AI Application Development Mastery Part 19: Building Real AI Applications

Introduction: From Theory to Production

                        
                        Series Overview: This is Part 17 of our 18-part AI Application Development Mastery series. After 16 parts of building foundational knowledge, frameworks, patterns, and advanced techniques, it is time to put everything together. In this part, we build four complete, production-ready AI applications from scratch.
                    

AI Application Development Mastery

Your 20-step learning path • Currently on Step 19

1

19

Building Real AI Applications

Chatbot, document QA, coding assistant, full-stack

You Are Here

20

Future of AI Applications

Autonomous agents, self-improving, multi-modal, AI OS

Reading about AI application architecture is one thing. Building production-ready applications is something else entirely. Every project in this part involves real decisions: which model to use, how to structure memory, how to handle failures, how to stream responses, how to manage costs, and how to deploy reliably.

Each project builds on concepts from earlier parts of this series. You will see RAG from Part 5, memory from Part 6, agents from Part 7, LangGraph from Part 8, design patterns from Part 11, production systems from Part 14, and safety from Part 15 all come together in working code.

                        
                        Key Insight: The four projects are ordered by complexity. Start with the chatbot (simplest) and work your way to the research agent (most complex). Each project introduces new patterns that subsequent projects build upon. By the end, you will have a portfolio of real AI applications.
                    

Project	Core Concepts	Difficulty
Chatbot with Memory	LangChain, Redis memory, streaming, session management	Intermediate
Document QA (RAG + FAISS)	Document loading, chunking, FAISS, re-ranking, citation	Intermediate-Advanced
AI Coding Assistant	Codebase indexing, AST parsing, code-aware RAG, agents	Advanced
Research Agent	LangGraph, web search, multi-step reasoning, report gen	Advanced

1. Project 1: Chatbot with Memory

Our first project is a conversational AI chatbot with persistent memory — it remembers previous conversations across sessions, maintains user preferences, and provides contextually relevant responses. This is the foundation of every AI assistant product.

1.1 Architecture & Memory Design

The architecture uses Redis for persistent conversation storage, LangChain for LLM orchestration and memory management, and FastAPI for the API layer with Server-Sent Events (SSE) streaming. Each user gets a unique session key, and Redis stores both the conversation history and user preferences as serialized JSON — enabling the chatbot to resume context across sessions.

# Project 1: Chatbot with Persistent Memory
# Architecture: LangChain + Redis + FastAPI + Streaming
# pip install langchain-openai langchain-community langchain-core redis

import os
from datetime import datetime
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.runnables.history import RunnableWithMessageHistory
from langchain_community.chat_message_histories import RedisChatMessageHistory
from langchain_core.output_parsers import StrOutputParser

# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"

class SmartChatbot:
    """Production chatbot with multi-layer memory."""

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis_url = redis_url
        self.llm = ChatOpenAI(
            model="gpt-4",
            temperature=0.7,
            streaming=True
        )

        # System prompt with personality and instructions
        self.prompt = ChatPromptTemplate.from_messages([
            ("system",
                "You are a helpful, knowledgeable AI assistant. "
                "You remember previous conversations and user preferences. "
                "Be concise but thorough. If you reference something from "
                "a previous conversation, mention that you remember it. "
                "Current date: {current_date}"),
            MessagesPlaceholder(variable_name="history"),
            ("human", "{input}")
        ])

        # Build the chain
        self.chain = self.prompt | self.llm | StrOutputParser()

        # Wrap with message history
        self.chain_with_history = RunnableWithMessageHistory(
            self.chain,
            self._get_session_history,
            input_messages_key="input",
            history_messages_key="history"
        )

    def _get_session_history(self, session_id: str):
        """Get or create Redis-backed chat history for a session."""
        return RedisChatMessageHistory(
            session_id=session_id,
            url=self.redis_url,
            ttl=86400 * 30  # 30-day TTL for conversations
        )

    async def chat(self, message: str, session_id: str):
        """Send a message and get a streaming response."""
        from datetime import datetime
        config = {"configurable": {"session_id": session_id}}

        async for chunk in self.chain_with_history.astream(
            {"input": message, "current_date": datetime.now().isoformat()},
            config=config
        ):
            yield chunk

    def get_history(self, session_id: str) -> list:
        """Retrieve conversation history for a session."""
        history = self._get_session_history(session_id)
        return history.messages

1.2 Full Implementation

The full implementation exposes the chatbot through FastAPI endpoints: /chat for standard request-response and /chat/stream for real-time token streaming via SSE. The server manages session lifecycle, routes messages through the LangChain chain, and persists conversation state to Redis after each exchange.

# FastAPI server for the chatbot
# pip install fastapi uvicorn
# Requires: SmartChatbot class from above

from fastapi import FastAPI
from fastapi.responses import StreamingResponse
from pydantic import BaseModel
import json

app = FastAPI(title="Smart Chatbot API")
chatbot = SmartChatbot()

class ChatRequest(BaseModel):
    message: str
    session_id: str

@app.post("/chat")
async def chat_endpoint(request: ChatRequest):
    """Streaming chat endpoint."""
    async def generate():
        async for chunk in chatbot.chat(
            request.message, request.session_id
        ):
            yield f"data: {json.dumps({'content': chunk})}\n\n"
        yield "data: [DONE]\n\n"

    return StreamingResponse(
        generate(),
        media_type="text/event-stream"
    )

@app.get("/history/{session_id}")
async def get_history(session_id: str):
    """Get conversation history for a session."""
    messages = chatbot.get_history(session_id)
    return {
        "session_id": session_id,
        "messages": [
            {"role": m.type, "content": m.content}
            for m in messages
        ]
    }

1.3 Persistent Memory with Redis

                        
                        Memory Architecture: The chatbot uses a three-tier memory system:
                        Short-term (buffer): Last 20 messages in the current session, kept in the LLM context window
Medium-term (Redis): Full conversation history per session, stored in Redis with 30-day TTL
Long-term (summary): Periodic summarization of old conversations, stored as a compressed context

# Enhanced memory with summarization for long conversations
# Requires: RedisChatMessageHistory from langchain_community

from langchain_core.messages import SystemMessage
from langchain_community.chat_message_histories import RedisChatMessageHistory

class EnhancedMemory:
    """Multi-tier memory with automatic summarization."""

    def __init__(self, llm, redis_url, max_messages=20):
        self.llm = llm
        self.redis_url = redis_url
        self.max_messages = max_messages

    async def get_context(self, session_id: str) -> list:
        """Get optimized context for the LLM."""
        history = RedisChatMessageHistory(
            session_id=session_id, url=self.redis_url
        )
        messages = history.messages

        if len(messages) <= self.max_messages:
            return messages

        # Summarize older messages
        old_messages = messages[:-self.max_messages]
        recent_messages = messages[-self.max_messages:]

        summary = await self.llm.ainvoke(
            f"Summarize this conversation concisely, preserving key facts, "
            f"user preferences, and important context:\n\n"
            + "\n".join(f"{m.type}: {m.content}" for m in old_messages)
        )

        return [
            SystemMessage(content=f"Previous conversation summary: {summary.content}"),
            *recent_messages
        ]

2. Project 2: Document QA (RAG + FAISS)

The document QA system allows users to upload PDFs, Word documents, or text files, then ask questions about them. It uses RAG with FAISS for fast, accurate retrieval, with re-ranking and citation tracking for production quality.

2.1 Ingestion & Retrieval Pipeline

The ingestion pipeline handles the complete document processing workflow: loading files from multiple formats (PDF, DOCX, TXT), splitting them into semantically meaningful chunks with overlap, generating embeddings, and storing them in a FAISS vector index. At query time, the retrieval pipeline finds the most relevant chunks and feeds them to the LLM as context for answer generation.

# Project 2: Document QA with RAG + FAISS
# pip install langchain-openai langchain-community langchain faiss-cpu pypdf docx2txt

import hashlib
from pathlib import Path
from langchain_community.document_loaders import (
    PyPDFLoader, Docx2txtLoader, TextLoader, UnstructuredHTMLLoader
)
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings, ChatOpenAI
from langchain_community.vectorstores import FAISS
from langchain_core.prompts import ChatPromptTemplate

# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"

class DocumentQA:
    """Production document QA system with RAG + FAISS."""

    def __init__(self, persist_dir: str = "./faiss_index"):
        self.persist_dir = persist_dir
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
        self.llm = ChatOpenAI(model="gpt-4", temperature=0)

        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=512,
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", " ", ""],
            length_function=len
        )

        self.qa_prompt = ChatPromptTemplate.from_messages([
            ("system",
                "Answer the question based ONLY on the following context. "
                "If the context doesn't contain enough information, say so. "
                "Always cite which document and section your answer comes from.\n\n"
                "Context:\n{context}"),
            ("human", "{question}")
        ])

        # Load existing index or create new
        self.vectorstore = self._load_or_create_index()

    def _load_or_create_index(self):
        """Load persisted FAISS index or create new one."""
        index_path = Path(self.persist_dir)
        if index_path.exists():
            return FAISS.load_local(
                self.persist_dir, self.embeddings,
                allow_dangerous_deserialization=True
            )
        return None

    def ingest(self, file_path: str) -> dict:
        """Ingest a document into the vector store."""
        # Select appropriate loader
        ext = Path(file_path).suffix.lower()
        loaders = {
            ".pdf": PyPDFLoader,
            ".docx": Docx2txtLoader,
            ".txt": TextLoader,
            ".html": UnstructuredHTMLLoader
        }

        loader = loaders.get(ext)
        if not loader:
            raise ValueError(f"Unsupported file type: {ext}")

        # Load and split
        documents = loader(file_path).load()
        chunks = self.text_splitter.split_documents(documents)

        # Add metadata
        doc_hash = hashlib.md5(file_path.encode()).hexdigest()[:8]
        for i, chunk in enumerate(chunks):
            chunk.metadata.update({
                "source_file": Path(file_path).name,
                "chunk_index": i,
                "doc_id": doc_hash,
                "total_chunks": len(chunks)
            })

        # Add to vector store
        if self.vectorstore is None:
            self.vectorstore = FAISS.from_documents(chunks, self.embeddings)
        else:
            self.vectorstore.add_documents(chunks)

        # Persist index
        self.vectorstore.save_local(self.persist_dir)

        return {
            "file": Path(file_path).name,
            "chunks": len(chunks),
            "status": "ingested"
        }

    def query(self, question: str, k: int = 5) -> dict:
        """Query the document store with re-ranking."""
        if not self.vectorstore:
            return {"error": "No documents ingested yet"}

        # Retrieve with score
        docs_with_scores = self.vectorstore.similarity_search_with_score(
            question, k=k * 2  # Over-retrieve for re-ranking
        )

        # Filter by relevance threshold
        relevant_docs = [
            (doc, score) for doc, score in docs_with_scores
            if score < 0.8  # Lower = more similar in FAISS L2
        ][:k]

        # Build context with citations
        context_parts = []
        sources = []
        for i, (doc, score) in enumerate(relevant_docs):
            source = doc.metadata.get("source_file", "Unknown")
            chunk_idx = doc.metadata.get("chunk_index", "?")
            context_parts.append(
                f"[Source {i+1}: {source}, Section {chunk_idx}]\n{doc.page_content}"
            )
            sources.append({
                "source": source,
                "chunk": chunk_idx,
                "relevance_score": round(1 - score, 3),
                "preview": doc.page_content[:200]
            })

        context = "\n\n".join(context_parts)

        # Generate answer
        chain = self.qa_prompt | self.llm
        answer = chain.invoke({
            "context": context,
            "question": question
        })

        return {
            "answer": answer.content,
            "sources": sources,
            "num_sources": len(sources)
        }

2.2 FAISS Vector Store

                        
                        Why FAISS? FAISS (Facebook AI Similarity Search) is the gold standard for in-memory vector search:
                        Speed: Searches billions of vectors in milliseconds using GPU acceleration
Memory: Efficient compression with Product Quantization (PQ) reduces memory by 4-64x
Flexibility: Supports exact (Flat) and approximate (IVF, HNSW) search indexes
No server: Runs as a library — no database server to manage

                    

2.3 Advanced RAG Techniques

Basic vector similarity search often misses relevant documents, especially when queries use different terminology than the source text. Hybrid search combines dense vector retrieval with sparse BM25 keyword matching, then uses a cross-encoder re-ranker to score the combined results — significantly improving retrieval accuracy for domain-specific document collections.

# Advanced RAG: Hybrid search + re-ranking
# pip install rank-bm25
# Requires: DocumentQA class from above

from langchain.retrievers import EnsembleRetriever
from langchain_community.retrievers import BM25Retriever

class AdvancedDocumentQA(DocumentQA):
    """Enhanced with hybrid search and cross-encoder re-ranking."""

    def _build_hybrid_retriever(self):
        """Combine semantic (FAISS) and keyword (BM25) search."""
        # Get all documents for BM25
        all_docs = list(self.vectorstore.docstore._dict.values())

        bm25_retriever = BM25Retriever.from_documents(all_docs)
        bm25_retriever.k = 10

        faiss_retriever = self.vectorstore.as_retriever(
            search_kwargs={"k": 10}
        )

        # Ensemble: combine both retrievers with weights
        return EnsembleRetriever(
            retrievers=[bm25_retriever, faiss_retriever],
            weights=[0.4, 0.6]  # Favor semantic search
        )

    def _rerank(self, question: str, docs: list) -> list:
        """Re-rank documents using LLM-based scoring."""
        scored = []
        for doc in docs:
            score_resp = self.llm.invoke(
                f"Rate relevance 0-10 of this passage to the question.\n"
                f"Question: {question}\nPassage: {doc.page_content[:500]}\n"
                f"Reply with ONLY a number."
            )
            try:
                score = float(score_resp.content.strip())
            except ValueError:
                score = 5.0
            scored.append((doc, score))
        scored.sort(key=lambda x: x[1], reverse=True)
        return [doc for doc, _ in scored]

    def query_advanced(self, question: str) -> dict:
        """Query with hybrid retrieval and re-ranking."""
        retriever = self._build_hybrid_retriever()
        docs = retriever.invoke(question)

        # Cross-encoder re-ranking for precision
        reranked = self._rerank(question, docs)

        context = "\n\n".join(
            f"[{d.metadata.get('source_file')}]: {d.page_content}"
            for d in reranked[:5]
        )

        chain = self.qa_prompt | self.llm
        answer = chain.invoke({"context": context, "question": question})

        return {"answer": answer.content, "sources": reranked[:5]}

3. Project 3: AI Coding Assistant

The coding assistant understands your entire codebase through code-aware RAG, can answer questions about architecture, generate code that follows your patterns, and execute code in a sandboxed environment. Think of it as a local Copilot that knows your project intimately.

3.1 Codebase RAG & Indexing

The codebase RAG system indexes an entire repository by walking the file tree, chunking source files by logical boundaries (functions, classes), and storing them in a FAISS vector index with metadata (file path, language, line numbers). At query time, it retrieves the most relevant code snippets and feeds them to the LLM alongside the developer’s question for context-aware code assistance.

# Project 3: AI Coding Assistant with Codebase RAG
# pip install langchain-openai langchain-community langchain faiss-cpu

import ast
import os
from pathlib import Path
from langchain.text_splitter import Language, RecursiveCharacterTextSplitter
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS

# Ensure OPENAI_API_KEY is set: export OPENAI_API_KEY="your-key-here"

class CodebaseIndexer:
    """Index a codebase for semantic search with AST awareness."""

    LANGUAGE_MAP = {
        ".py": Language.PYTHON,
        ".js": Language.JS,
        ".ts": Language.TS,
        ".java": Language.JAVA,
        ".go": Language.GO,
        ".rs": Language.RUST,
    }

    def __init__(self):
        self.embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

    def index_codebase(self, root_dir: str, ignore_patterns=None) -> FAISS:
        """Index an entire codebase with language-aware chunking."""
        ignore_patterns = ignore_patterns or [
            "node_modules", ".git", "__pycache__", ".venv",
            "venv", "dist", "build", ".next"
        ]

        all_chunks = []

        for path in Path(root_dir).rglob("*"):
            # Skip ignored directories
            if any(p in str(path) for p in ignore_patterns):
                continue

            ext = path.suffix.lower()
            if ext not in self.LANGUAGE_MAP:
                continue

            try:
                content = path.read_text(encoding="utf-8")
            except (UnicodeDecodeError, PermissionError):
                continue

            # Language-aware splitting
            language = self.LANGUAGE_MAP[ext]
            splitter = RecursiveCharacterTextSplitter.from_language(
                language=language,
                chunk_size=1000,
                chunk_overlap=100
            )

            chunks = splitter.create_documents(
                [content],
                metadatas=[{
                    "source": str(path.relative_to(root_dir)),
                    "language": ext,
                    "functions": self._extract_functions(content, ext)
                }]
            )
            all_chunks.extend(chunks)

        # Build FAISS index
        vectorstore = FAISS.from_documents(all_chunks, self.embeddings)
        return vectorstore

    def _extract_functions(self, code: str, ext: str) -> str:
        """Extract function/class names for metadata."""
        if ext != ".py":
            return ""
        try:
            tree = ast.parse(code)
            names = []
            for node in ast.walk(tree):
                if isinstance(node, (ast.FunctionDef, ast.AsyncFunctionDef)):
                    names.append(f"def {node.name}")
                elif isinstance(node, ast.ClassDef):
                    names.append(f"class {node.name}")
            return ", ".join(names)
        except SyntaxError:
            return ""

3.2 Agent with Code Execution

The coding agent extends the RAG system with tool use — it can not only retrieve and explain code, but also execute code in a sandboxed environment, run tests, and iterate on solutions. The agent uses a ReAct loop: it reasons about the task, decides which tool to use (search codebase, execute code, run tests), observes the result, and continues until the task is complete.

# Coding assistant agent with code execution
# pip install langchain-openai langchain
# Requires: vectorstore from CodebaseIndexer above

import os
import subprocess
import tempfile
from pathlib import Path
from langchain_openai import ChatOpenAI
from langchain_core.tools import tool
from langchain.agents import AgentExecutor, create_tool_calling_agent
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder

@tool
def search_codebase(query: str) -> str:
    """Search the indexed codebase for relevant code snippets."""
    docs = vectorstore.similarity_search(query, k=5)
    results = []
    for doc in docs:
        results.append(
            f"File: {doc.metadata['source']}\n"
            f"Functions: {doc.metadata.get('functions', 'N/A')}\n"
            f"Code:\n{doc.page_content}\n"
        )
    return "\n---\n".join(results)

@tool
def execute_python(code: str) -> str:
    """Execute Python code in a sandboxed environment. Returns stdout/stderr."""
    with tempfile.NamedTemporaryFile(mode="w", suffix=".py", delete=False) as f:
        f.write(code)
        f.flush()
        try:
            result = subprocess.run(
                ["python", f.name],
                capture_output=True, text=True, timeout=30,
                env={**os.environ, "PYTHONDONTWRITEBYTECODE": "1"}
            )
            output = result.stdout
            if result.stderr:
                output += f"\nSTDERR: {result.stderr}"
            return output or "Code executed successfully (no output)"
        except subprocess.TimeoutExpired:
            return "ERROR: Code execution timed out (30s limit)"
        finally:
            os.unlink(f.name)

@tool
def read_file(file_path: str) -> str:
    """Read the contents of a file in the codebase."""
    try:
        return Path(file_path).read_text(encoding="utf-8")[:5000]
    except Exception as e:
        return f"Error reading file: {e}"

# Build the coding assistant agent
coding_prompt = ChatPromptTemplate.from_messages([
    ("system",
        "You are an expert AI coding assistant that understands the user's "
        "codebase deeply. You can:\n"
        "1. Search the codebase for relevant code\n"
        "2. Read specific files\n"
        "3. Execute Python code to test solutions\n\n"
        "Always search the codebase first to understand existing patterns "
        "before generating new code. Follow the project's coding style."),
    MessagesPlaceholder(variable_name="chat_history"),
    ("human", "{input}"),
    MessagesPlaceholder(variable_name="agent_scratchpad")
])

llm = ChatOpenAI(model="gpt-4", temperature=0)
tools = [search_codebase, execute_python, read_file]
agent = create_tool_calling_agent(llm, tools, coding_prompt)
coding_assistant = AgentExecutor(agent=agent, tools=tools, verbose=True)

3.3 Context-Aware Code Generation

                        
                        Code Generation Best Practices:
                        Always retrieve first: Search the codebase for existing patterns before generating new code
Include imports: Analyze the project's import style and dependency versions
Match style: Copy naming conventions, docstring format, and error handling patterns from the existing codebase
Test generation: When generating a function, also generate unit tests that follow the project's testing framework

                    

4. Project 4: Research Agent

The research agent is the most complex project. It performs multi-step research: searches the web, reads articles, extracts key findings, synthesizes information from multiple sources, identifies gaps, and generates a comprehensive research report — all orchestrated by a LangGraph state machine.

4.1 LangGraph Workflow Design

The research agent uses a LangGraph state machine with specialized nodes: a planner that decomposes the research question into sub-queries, a researcher that executes web searches and extracts key findings, a synthesizer that combines findings into a coherent analysis, and a reviewer that checks for completeness and triggers additional research if needed.

# Project 4: Research Agent with LangGraph
# pip install langgraph langchain-openai langchain-community tavily-python

import json
import operator
from typing import TypedDict, Annotated, List
from langgraph.graph import StateGraph, END
from langgraph.checkpoint.memory import MemorySaver
from langchain_openai import ChatOpenAI

# Ensure these env vars are set:
# export OPENAI_API_KEY="your-key-here"
# export TAVILY_API_KEY="your-tavily-key-here"

class ResearchState(TypedDict):
    """State for the research agent workflow."""
    topic: str
    search_queries: List[str]
    search_results: Annotated[List[dict], operator.add]
    summaries: Annotated[List[str], operator.add]
    gaps: List[str]
    final_report: str
    iteration: int
    max_iterations: int

class ResearchAgent:
    """Multi-step research agent powered by LangGraph."""

    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4", temperature=0)
        self.graph = self._build_graph()

    def _build_graph(self) -> StateGraph:
        """Build the LangGraph research workflow."""
        workflow = StateGraph(ResearchState)

        # Add nodes
        workflow.add_node("generate_queries", self.generate_queries)
        workflow.add_node("web_search", self.web_search)
        workflow.add_node("summarize_sources", self.summarize_sources)
        workflow.add_node("identify_gaps", self.identify_gaps)
        workflow.add_node("generate_report", self.generate_report)

        # Define edges
        workflow.set_entry_point("generate_queries")
        workflow.add_edge("generate_queries", "web_search")
        workflow.add_edge("web_search", "summarize_sources")
        workflow.add_edge("summarize_sources", "identify_gaps")

        # Conditional: loop back if gaps found and under iteration limit
        workflow.add_conditional_edges(
            "identify_gaps",
            self._should_continue,
            {"continue": "generate_queries", "finish": "generate_report"}
        )
        workflow.add_edge("generate_report", END)

        return workflow.compile(checkpointer=MemorySaver())

    def _should_continue(self, state: ResearchState) -> str:
        """Decide whether to do another research iteration."""
        if state["iteration"] >= state["max_iterations"]:
            return "finish"
        if not state.get("gaps") or len(state["gaps"]) == 0:
            return "finish"
        return "continue"

    async def generate_queries(self, state: ResearchState) -> dict:
        """Generate search queries based on topic and gaps."""
        gaps_context = ""
        if state.get("gaps"):
            gaps_context = f"\nKnowledge gaps to fill: {', '.join(state['gaps'])}"

        response = await self.llm.ainvoke(
            f"Generate 3-5 diverse search queries to research: "
            f"{state['topic']}{gaps_context}\n"
            f"Return as a JSON array of strings."
        )
        queries = json.loads(response.content)
        return {"search_queries": queries, "iteration": state["iteration"] + 1}

    async def web_search(self, state: ResearchState) -> dict:
        """Execute web searches and collect results."""
        from langchain_community.tools.tavily_search import TavilySearchResults
        search_tool = TavilySearchResults(max_results=3)

        results = []
        for query in state["search_queries"]:
            search_results = await search_tool.ainvoke(query)
            results.extend(search_results)

        return {"search_results": results}

    async def summarize_sources(self, state: ResearchState) -> dict:
        """Summarize each search result."""
        summaries = []
        for result in state["search_results"][-10:]:  # Last 10 results
            summary = await self.llm.ainvoke(
                f"Summarize the key findings from this source "
                f"relevant to '{state['topic']}':\n\n"
                f"Title: {result.get('title', 'N/A')}\n"
                f"Content: {result.get('content', '')[:2000]}"
            )
            summaries.append(summary.content)

        return {"summaries": summaries}

    async def identify_gaps(self, state: ResearchState) -> dict:
        """Identify knowledge gaps in current research."""
        all_summaries = "\n\n".join(state["summaries"])
        response = await self.llm.ainvoke(
            f"Given this research on '{state['topic']}':\n\n"
            f"{all_summaries}\n\n"
            f"Identify 0-3 important knowledge gaps that need more research. "
            f"If the research is comprehensive, return an empty list. "
            f"Return as a JSON array of strings."
        )
        gaps = json.loads(response.content)
        return {"gaps": gaps}

    async def generate_report(self, state: ResearchState) -> dict:
        """Generate the final research report."""
        all_summaries = "\n\n".join(state["summaries"])
        response = await self.llm.ainvoke(
            f"Write a comprehensive research report on '{state['topic']}' "
            f"based on these findings:\n\n{all_summaries}\n\n"
            f"Structure: Executive Summary, Key Findings (numbered), "
            f"Analysis, Conclusions, and Sources."
        )
        return {"final_report": response.content}

    async def research(self, topic: str, max_iterations: int = 3) -> dict:
        """Run the full research workflow."""
        initial_state = {
            "topic": topic,
            "search_queries": [],
            "search_results": [],
            "summaries": [],
            "gaps": [],
            "final_report": "",
            "iteration": 0,
            "max_iterations": max_iterations
        }

        config = {"configurable": {"thread_id": topic[:50]}}
        result = await self.graph.ainvoke(initial_state, config=config)

        return {
            "topic": topic,
            "report": result["final_report"],
            "sources_count": len(result["search_results"]),
            "iterations": result["iteration"]
        }

4.2 Web Search & Summarization

                        
                        Research Agent Design Principles:
                        Iterative deepening: Start broad, identify gaps, search deeper. Each iteration refines understanding.
Source diversity: Generate diverse search queries to avoid echo chamber effects.
Citation tracking: Every claim in the final report should trace back to a specific source.
Bounded iterations: Cap at 2-3 iterations to control cost. Diminishing returns after 3 rounds.

                    

4.3 Report Generation Pipeline

The report generation pipeline transforms raw research findings into a structured, professional document. It uses structured output (Pydantic models) to enforce a consistent report format with sections, key findings, evidence citations, and confidence levels — then renders the structured data into Markdown with proper formatting.

# Enhanced report generation with structured output
# Requires: llm = ChatOpenAI instance from above

from pydantic import BaseModel, Field
from typing import List

class ResearchReport(BaseModel):
    """Structured research report output."""
    executive_summary: str = Field(description="2-3 sentence overview")
    key_findings: List[str] = Field(description="Numbered key findings")
    analysis: str = Field(description="Detailed analysis")
    conclusions: str = Field(description="Final conclusions")
    sources: List[dict] = Field(description="List of sources with URLs")
    confidence: float = Field(description="Confidence score 0-1")

# Use structured output for reliable report format
structured_llm = llm.with_structured_output(ResearchReport)

5. Full-Stack Architecture

A production AI application requires much more than just the AI logic. Here is the complete full-stack architecture that ties all four projects together into a deployable system.

5.1 React + FastAPI + LangChain + pgvector + Redis + Docker

Layer	Technology	Purpose
Frontend	React + TypeScript + Tailwind CSS	Chat UI, document upload, streaming display
API Gateway	FastAPI + Pydantic	REST/WebSocket endpoints, validation, auth
AI Orchestration	LangChain + LangGraph	Chains, agents, RAG pipelines, workflows
Vector Database	PostgreSQL + pgvector	Persistent vector storage with SQL capabilities
Cache & Memory	Redis	Session memory, response cache, rate limiting
Container Orchestration	Docker Compose / Kubernetes	Service orchestration, scaling, deployment

5.2 API Design & Streaming

The production API layer wraps the AI pipeline in FastAPI with proper error handling, request validation, rate limiting, and WebSocket streaming for real-time token delivery. The API design follows RESTful conventions for synchronous endpoints and WebSocket connections for streaming — both backed by Redis for session management and caching.

# Production FastAPI application
# pip install fastapi uvicorn redis websockets
# Requires: SmartChatbot, init_vectorstore from project modules

from fastapi import FastAPI, HTTPException, Depends, WebSocket
from fastapi.middleware.cors import CORSMiddleware
from contextlib import asynccontextmanager
import redis.asyncio as redis

@asynccontextmanager
async def lifespan(app: FastAPI):
    """Application lifecycle management."""
    # Startup: initialize connections
    app.state.redis = redis.Redis.from_url("redis://redis:6379")
    app.state.vectorstore = await init_vectorstore()
    yield
    # Shutdown: cleanup
    await app.state.redis.close()

app = FastAPI(title="AI Application Platform", lifespan=lifespan)

app.add_middleware(
    CORSMiddleware,
    allow_origins=["http://localhost:3000"],
    allow_methods=["*"],
    allow_headers=["*"],
)

# WebSocket for real-time streaming
@app.websocket("/ws/chat/{session_id}")
async def websocket_chat(websocket: WebSocket, session_id: str):
    await websocket.accept()
    chatbot = SmartChatbot()

    try:
        while True:
            message = await websocket.receive_text()
            async for chunk in chatbot.chat(message, session_id):
                await websocket.send_json({"type": "chunk", "content": chunk})
            await websocket.send_json({"type": "done"})
    except Exception:
        await websocket.close()

5.3 Docker Compose Deployment

Docker Compose orchestrates the complete application stack: the FastAPI application server, Redis for caching and session storage, Prometheus for metrics collection, and Grafana for monitoring dashboards. The configuration below includes health checks, resource limits, and environment variable injection for production readiness.

# docker-compose.yml
version: "3.9"

services:
  frontend:
    build: ./frontend
    ports:
      - "3000:3000"
    environment:
      - REACT_APP_API_URL=http://localhost:8000
    depends_on:
      - api

  api:
    build: ./backend
    ports:
      - "8000:8000"
    environment:
      - OPENAI_API_KEY=${OPENAI_API_KEY}
      - REDIS_URL=redis://redis:6379
      - DATABASE_URL=postgresql://ai:secret@postgres:5432/aiapp
    depends_on:
      - redis
      - postgres
    volumes:
      - ./data:/app/data  # Persist FAISS indexes

  redis:
    image: redis:7-alpine
    ports:
      - "6379:6379"
    volumes:
      - redis_data:/data

  postgres:
    image: pgvector/pgvector:pg16
    environment:
      POSTGRES_DB: aiapp
      POSTGRES_USER: ai
      POSTGRES_PASSWORD: secret
    ports:
      - "5432:5432"
    volumes:
      - pg_data:/var/lib/postgresql/data

volumes:
  redis_data:
  pg_data:

6. Deployment Patterns

Deploying AI applications requires careful consideration of infrastructure choices across the three major cloud providers, scaling strategies for handling variable LLM latency, and monitoring patterns specific to AI workloads. This section covers cloud service mappings, Kubernetes configurations, and serverless deployment options for each project type.

6.1 Cloud Deployment (AWS/GCP/Azure)

Component	AWS	GCP	Azure
API Server	ECS Fargate / Lambda	Cloud Run	Azure Container Apps
Vector DB	RDS + pgvector / OpenSearch	AlloyDB + pgvector	Azure PostgreSQL + pgvector
Cache	ElastiCache (Redis)	Memorystore	Azure Cache for Redis
Frontend	CloudFront + S3	Cloud CDN + GCS	Azure CDN + Blob Storage
Queue	SQS	Pub/Sub	Service Bus

6.2 Scaling Strategies

                        
                        Scaling AI Applications:
                        Horizontal scaling: Run multiple API server replicas behind a load balancer. Each replica is stateless (state in Redis/Postgres).
Queue-based processing: For long-running tasks (research agent, document ingestion), use a job queue (Celery, Bull) to process asynchronously.
Caching: Cache LLM responses for identical queries. A semantic cache can match similar (not just identical) queries.
Connection pooling: Pool LLM API connections to avoid rate limits and reduce latency.

                    

7. Cost Estimation

Understanding the per-query cost of each AI application is critical for pricing decisions, budget planning, and optimization prioritization. Costs vary dramatically across project types — a simple chatbot query costs fractions of a cent, while a multi-step research agent query can cost 10-50x more due to multiple LLM calls and tool invocations.

7.1 Per-Query Cost Breakdown

Component	Chatbot	Document QA	Coding Assistant	Research Agent
LLM (input)	$0.006	$0.012	$0.018	$0.060
LLM (output)	$0.012	$0.018	$0.024	$0.090
Embeddings	$0.000	$0.0001	$0.0002	$0.0003
Web search API	$0.000	$0.000	$0.000	$0.015
Total per query	~$0.018	~$0.030	~$0.042	~$0.165
Monthly (1K queries/day)	~$540	~$900	~$1,260	~$4,950

7.2 Cost Optimization Tips

The biggest cost savings come from semantic caching (storing and reusing responses for similar queries), model routing (sending simple queries to cheaper models), and prompt compression (reducing token count while preserving meaning). The implementation below combines all three strategies with Redis-backed caching and configurable similarity thresholds.

# Cost optimization strategies
# pip install langchain-openai redis

import json
import hashlib
import redis.asyncio as redis
from langchain_openai import ChatOpenAI

class CostOptimizer:
    """Strategies to reduce AI application costs."""

    def __init__(self):
        self.llm = ChatOpenAI(model="gpt-4")
        self.small_llm = ChatOpenAI(model="gpt-3.5-turbo")

    # 1. Semantic caching — avoid redundant LLM calls
    async def cached_query(self, query: str, cache: redis.Redis):
        cache_key = f"llm:{hashlib.md5(query.encode()).hexdigest()}"
        cached = await cache.get(cache_key)
        if cached:
            return json.loads(cached)  # Free!

        result = await self.llm.ainvoke(query)
        await cache.set(cache_key, json.dumps(result), ex=3600)
        return result

    # 2. Model routing — use cheaper models for simple queries
    def route_to_model(self, query: str, complexity: str):
        if complexity == "simple":
            return ChatOpenAI(model="gpt-3.5-turbo")  # 10x cheaper
        return ChatOpenAI(model="gpt-4")

    # 3. Prompt compression — reduce token count
    def compress_context(self, context: str, max_tokens: int = 1000):
        """Summarize context to fit within token budget."""
        if len(context.split()) < max_tokens:
            return context
        # Use a small model to compress
        compressed = self.small_llm.invoke(
            f"Compress this to under {max_tokens} tokens, "
            f"keeping all key facts:\n{context}"
        )
        return compressed.content

Exercises & Self-Assessment

Exercise 1

Build the Chatbot

Implement the full chatbot with Redis memory. Test that conversations persist across server restarts.
Add user authentication (session tokens) so different users have separate conversation histories.
Implement the three-tier memory system (buffer, Redis, summary). Test with a 100+ message conversation.

Exercise 2

Build the Document QA

Implement the document QA system. Ingest 3 different PDF documents and test with 20 questions spanning all documents.
Compare basic FAISS retrieval vs. hybrid (FAISS + BM25) retrieval. Measure precision at k=5 for 10 queries.
Add citation tracking: every sentence in the answer should reference a specific source document and page.

Exercise 3

Build the Coding Assistant

Index an open-source Python project (e.g., Flask, Requests) and ask architectural questions about it.
Test code generation: ask the assistant to add a new feature that follows the project's existing patterns.
Implement the sandboxed code execution tool. Test with safe code and verify that dangerous operations are blocked.

Exercise 4

Build the Research Agent

Implement the full LangGraph research agent. Run it on 3 different topics and evaluate report quality.
Add a "fact-checking" node that verifies claims against multiple sources before including them in the report.
Implement report export in multiple formats (Markdown, PDF, DOCX) with proper citations.

Exercise 5

Reflective Questions

What are the trade-offs between FAISS (in-memory) and pgvector (database-backed) for vector storage? When would you choose each?
How would you handle document updates in the RAG system? If a PDF is updated, how do you refresh only the changed chunks?
Compare the cost of running these applications with GPT-4 vs. a locally deployed Llama model. At what scale does self-hosting become cheaper?
How would you add multi-tenancy to the full-stack architecture? What changes are needed at each layer?
Design a monitoring and alerting system for the research agent. What metrics would you track? What constitutes a "failed" research?

AI Application Project Document Generator

Plan and document your AI application project. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Project Name *

Project Type *

Tech Stack *

Architecture Description

Key Features

Cost Estimation

Timeline & Milestones

Additional Notes

Author Name

Conclusion & Next Steps

You have now built four complete, production-ready AI applications — each demonstrating different aspects of the AI application stack. Here are the key takeaways from Part 19:

Chatbot with Memory — Redis-backed persistent memory with three-tier architecture (buffer, session, summary) creates conversational AI that truly remembers
Document QA (RAG + FAISS) — Hybrid retrieval (semantic + keyword) with re-ranking and citation tracking produces accurate, verifiable answers from your documents
AI Coding Assistant — Code-aware RAG with AST parsing, language-specific chunking, and sandboxed execution creates a project-aware development companion
Research Agent — LangGraph orchestration with iterative deepening, gap identification, and structured report generation automates complex research workflows
Full-Stack Architecture — React + FastAPI + LangChain + pgvector + Redis + Docker provides a production-grade foundation for any AI application
Cost management — Semantic caching, model routing, and prompt compression can reduce LLM costs by 50-80%

Next in the Series

In Part 20: Future of AI Applications (FINAL), we look ahead to fully autonomous agents, self-improving systems with DSPy, multi-modal AI, AI-native operating systems, the evolution of MCP, agentic infrastructure at scale, frontier research directions, and the societal implications of AI applications.

Cookie Consent

Cookie Preferences

AI Application Development Mastery Part 19: Building Real AI Applications

Table of Contents

Introduction: From Theory to Production

AI Application Development Mastery

Foundations & Evolution of AI Apps

LLM Fundamentals for Developers

Prompt Engineering Mastery

LangChain Core Concepts

Retrieval-Augmented Generation (RAG)

Memory & Context Engineering

Agents — Core of Modern AI Apps

LangGraph — Stateful Agent Workflows

Deep Agents & Autonomous Systems

Multi-Agent Systems

AI Application Design Patterns

Ecosystem & Frameworks

MCP Foundations & Architecture

MCP in Production

Evaluation & LLMOps

Production AI Systems

Safety, Guardrails & Reliability

Advanced Topics