Back to Technology

AI Application Development Mastery Part 6: Memory & Context Engineering

April 1, 2026 Wasil Zafar 40 min read

Give your AI applications the ability to remember. Master buffer, summary, window, vector, and entity memory patterns, implement long-term memory with persistent storage, and learn context engineering techniques — intelligent chunking, re-ranking with Cohere and ColBERT, prompt compression, and context window management strategies that maximize every token.

Table of Contents

  1. Memory Types
  2. Long-Term Memory
  3. Context Engineering
  4. Re-Ranking
  5. Prompt Compression
  6. Context Window Management
  7. Exercises & Self-Assessment
  8. Memory Design Generator
  9. Conclusion & Next Steps

Introduction: The Memory Problem in AI Applications

Series Overview: This is Part 6 of our 18-part AI Application Development Mastery series. We will take you from foundational understanding through prompt engineering, LangChain, RAG systems, agents, LangGraph, multi-agent architectures, production deployment, and building real-world AI applications.

AI Application Development Mastery

Your 20-step learning path • Currently on Step 6
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
You Are Here
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS

LLMs are stateless by design. Every API call starts from scratch — the model has no memory of previous interactions. This means that without explicit memory management, your chatbot forgets what the user said two messages ago, your agent cannot learn from its mistakes, and your assistant cannot build a relationship with its user over time.

Memory gives AI applications the ability to maintain state across interactions. Context engineering ensures that the right information reaches the model at the right time, within the constraints of finite context windows. Together, they transform a stateless API call into an intelligent, contextually-aware application.

Key Insight: Context engineering is becoming as important as prompt engineering. As Anthropic's CEO Dario Amodei noted, we are moving from "prompt engineering" to "context engineering" — the art of providing the model with exactly the right information (memories, retrieved docs, tool results, system instructions) within a finite token budget.
Topic What You Will Learn
Memory Types Buffer, summary, window, vector, and entity memory implementations
Long-Term Memory Persistent storage backends and memory architecture patterns
Context Engineering Designing context strategies and intelligent chunking
Re-Ranking Cohere, ColBERT, and cross-encoder re-ranking for precision
Prompt Compression LLMLingua and contextual compression to maximize token usage
Context Window Management Token budgeting and sliding window strategies

1. Memory Types

LangChain provides several memory implementations, each optimized for different use cases. Understanding when to use each type is critical for building effective conversational AI applications.

1.1 Buffer Memory

The simplest memory type — stores the complete conversation history verbatim. Fast and accurate but grows linearly with conversation length:

# pip install langchain-openai langchain-core
import os
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser
from langchain_core.messages import HumanMessage, AIMessage

# Set your API key: export OPENAI_API_KEY="sk-..."
os.environ.setdefault("OPENAI_API_KEY", os.getenv("OPENAI_API_KEY", ""))

# Buffer memory - stores full conversation as a list of messages
history = []

# Build a conversational chain with memory
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful AI assistant. Use the conversation "
               "history to provide contextual responses."),
    MessagesPlaceholder(variable_name="history"),
    ("human", "{input}"),
])

model = ChatOpenAI(model="gpt-4o", temperature=0.7)

# Manual memory management with LCEL
def chat_with_memory(user_input: str) -> str:
    """Process a message with conversation memory."""
    chain = prompt | model | StrOutputParser()
    response = chain.invoke({
        "history": history,
        "input": user_input,
    })

    # Save to memory
    history.append(HumanMessage(content=user_input))
    history.append(AIMessage(content=response))

    return response

# Multi-turn conversation
print(chat_with_memory("My name is Alex and I'm building a RAG system."))
print(chat_with_memory("What embedding model would you recommend?"))
print(chat_with_memory("What was I building again?"))  # Recalls "RAG system"

1.2 Summary Memory

Summary memory uses an LLM to maintain a running summary of the conversation. It grows slowly regardless of conversation length, making it ideal for long interactions:

# pip install langchain-openai
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, AIMessage, SystemMessage

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Summary memory - LLM progressively summarizes the conversation
running_summary = ""

def update_summary(human_input: str, ai_output: str) -> str:
    """Ask the LLM to update the running summary with new messages."""
    global running_summary
    response = model.invoke([
        SystemMessage(content="Progressively summarize the conversation, "
                      "adding to the existing summary with new lines."),
        HumanMessage(content=(
            f"Current summary:\n{running_summary}\n\n"
            f"New lines:\nHuman: {human_input}\nAI: {ai_output}\n\n"
            f"Updated summary:"
        )),
    ])
    running_summary = response.content
    return running_summary

# After many messages, the summary stays concise:
update_summary(
    "I want to build a RAG system for legal documents",
    "Great choice! Legal RAG requires careful attention to..."
)
update_summary(
    "Should I use Pinecone or Qdrant?",
    "For legal documents, I'd recommend Qdrant because..."
)

# The summary is much smaller than the full conversation
print(f"Summary ({len(running_summary)} chars):\n{running_summary}")

1.3 Conversation Window Memory

Window memory keeps only the last N interactions. Simple, predictable token usage, but loses older context:

# pip install langchain-core
from langchain_core.messages import HumanMessage, AIMessage

# Window memory - keep only the last K exchanges
class ConversationWindow:
    def __init__(self, k: int = 10):
        self.k = k
        self.messages = []

    def add(self, human_input: str, ai_output: str):
        self.messages.append(HumanMessage(content=human_input))
        self.messages.append(AIMessage(content=ai_output))
        # Trim to last k exchanges (k * 2 messages)
        if len(self.messages) > self.k * 2:
            self.messages = self.messages[-self.k * 2:]

    def get_history(self) -> list:
        return self.messages

# Keep only the last 10 exchanges
window = ConversationWindow(k=10)
window.add("I'm building a RAG system", "Great! What domain?")
window.add("Legal documents", "Legal RAG needs careful chunking...")
window.add("What about embeddings?", "Use text-embedding-3-small...")

print(f"Window has {len(window.get_history())} messages")
for msg in window.get_history():
    role = "Human" if isinstance(msg, HumanMessage) else "AI"
    print(f"  {role}: {msg.content[:50]}...")

1.4 Vector Store Memory

Vector memory stores conversation turns as embeddings and retrieves the most relevant past interactions using similarity search. This is powerful for long-running conversations where only specific past context is relevant:

# pip install langchain-chroma langchain-openai chromadb
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Create a vector store for conversation memory
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
memory_store = Chroma(
    collection_name="conversation_memory",
    embedding_function=embeddings,
)

# Save conversation turns as documents
memory_store.add_documents([
    Document(page_content="Human: I prefer using PostgreSQL for all my databases\n"
             "AI: Noted! pgvector would be a great choice for you then."),
    Document(page_content="Human: My team uses Python and FastAPI for backends\n"
             "AI: LangChain integrates perfectly with FastAPI..."),
])

# Later, retrieve relevant past conversations by similarity
retriever = memory_store.as_retriever(search_kwargs={"k": 3})
relevant = retriever.invoke("What vector database should I use?")

# Returns the PostgreSQL conversation as most relevant
for doc in relevant:
    print(doc.page_content[:80])

1.5 Entity Memory

Entity memory extracts and maintains information about specific entities (people, projects, technologies) mentioned in conversation. It builds a structured knowledge graph from unstructured dialogue:

# pip install langchain-openai
from langchain_openai import ChatOpenAI
from langchain_core.messages import HumanMessage, SystemMessage
import json

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Entity memory - extract and track entities from conversation
entity_store = {}  # {entity_name: description}

def extract_entities(human_input: str, ai_output: str):
    """Use the LLM to extract entities from a conversation turn."""
    response = model.invoke([
        SystemMessage(content=(
            "Extract named entities (people, projects, technologies) from this "
            "conversation turn. Return JSON: {\"entity_name\": \"description\"}. "
            "Return {} if no entities found."
        )),
        HumanMessage(content=f"Human: {human_input}\nAI: {ai_output}"),
    ])
    try:
        new_entities = json.loads(response.content)
        entity_store.update(new_entities)
    except json.JSONDecodeError:
        pass

# After several interactions, entity memory maintains:
extract_entities(
    "Our project Phoenix uses LangGraph for agent orchestration",
    "Project Phoenix sounds interesting! LangGraph is great for..."
)
extract_entities(
    "Sarah leads the ML team and prefers Anthropic models",
    "Got it! Sarah's preference for Anthropic models means..."
)

# Entity store now contains:
# "Phoenix": "A project that uses LangGraph for agent orchestration"
# "Sarah": "Leads the ML team, prefers Anthropic models"
for name, desc in entity_store.items():
    print(f"  {name}: {desc}")
Memory Selection Guide: Use buffer for short conversations (less than 20 turns). Use window for medium conversations with predictable token budgets. Use summary for long conversations where you need general context. Use vector for very long histories where only specific past context matters. Use entity when tracking relationships between people, projects, and concepts is important.

2. Long-Term Memory

Session-level memory disappears when the application restarts. Long-term memory persists across sessions, enabling AI applications to build relationships with users over days, weeks, and months.

2.1 Persistent Storage Backends

While in-memory stores work for prototyping, production applications need durable memory that survives restarts. Redis is a popular choice for long-term memory storage because it combines key-value simplicity with features like TTL expiry (automatically forgetting stale memories), sorted sets for temporal ordering, and sub-millisecond latency. The following implementation wraps Redis with a semantic layer that stores user memories, retrieves them by recency, and builds user profiles from accumulated interactions.

# pip install redis
import json
import redis
from datetime import datetime
from typing import List, Dict, Optional

class RedisLongTermMemory:
    """Production-grade long-term memory using Redis."""

    def __init__(self, redis_url: str = "redis://localhost:6379"):
        self.redis = redis.from_url(redis_url)
        self.ttl_days = 90  # Memories expire after 90 days

    def save_memory(
        self,
        user_id: str,
        memory_type: str,
        content: str,
        metadata: Optional[Dict] = None
    ):
        """Save a memory with metadata and TTL."""
        memory = {
            "content": content,
            "type": memory_type,
            "timestamp": datetime.utcnow().isoformat(),
            "metadata": metadata or {},
        }
        key = f"memory:{user_id}:{memory_type}"
        self.redis.rpush(key, json.dumps(memory))
        self.redis.expire(key, self.ttl_days * 86400)

    def get_memories(
        self,
        user_id: str,
        memory_type: str,
        limit: int = 20
    ) -> List[Dict]:
        """Retrieve recent memories of a specific type."""
        key = f"memory:{user_id}:{memory_type}"
        raw = self.redis.lrange(key, -limit, -1)
        return [json.loads(m) for m in raw]

    def get_user_profile(self, user_id: str) -> Dict:
        """Build a user profile from accumulated memories."""
        preferences = self.get_memories(user_id, "preference", limit=50)
        facts = self.get_memories(user_id, "fact", limit=50)
        interactions = self.get_memories(user_id, "interaction", limit=10)

        return {
            "preferences": [m["content"] for m in preferences],
            "facts": [m["content"] for m in facts],
            "recent_interactions": [m["content"] for m in interactions],
            "memory_count": len(preferences) + len(facts),
        }

# Usage
ltm = RedisLongTermMemory()
ltm.save_memory("user-123", "preference", "Prefers Python over JavaScript")
ltm.save_memory("user-123", "fact", "Works at Acme Corp as a senior engineer")
ltm.save_memory("user-123", "preference", "Likes detailed code examples")

profile = ltm.get_user_profile("user-123")
# Inject into system prompt for personalized responses

2.2 Memory Architecture Patterns

Sophisticated AI applications often need multiple memory tiers working together — mirroring how human memory operates. A tiered memory system combines working memory (a fast buffer for the current conversation), episodic memory (a vector store for retrieving relevant past experiences), and semantic memory (structured storage for facts and preferences). This architecture lets agents maintain both short-term conversational flow and long-term knowledge accumulation.

Tiered Memory Architecture
graph TD
    subgraph Tiered ["LLM Memory Tiers"]
        WM["Working Memory
Current Conversation"]
        STM["Short-Term Memory
Recent Sessions Buffer"]
        LTM["Long-Term Memory
Persistent Knowledge via Vector DB"]
    end

    WM --> STM
    STM --> LTM
    LTM -.->|Recall| WM

    style Tiered fill:#f8f9fa,stroke:#132440
    style WM fill:#e8f4f4,stroke:#3B9797
    style STM fill:#f0f4f8,stroke:#16476A
    style LTM fill:#132440,stroke:#132440,color:#fff
                        
from typing import List
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain_core.output_parsers import StrOutputParser

class TieredMemorySystem:
    """
    Three-tier memory architecture:
    - Working memory: Current conversation (buffer, last N messages)
    - Episodic memory: Past conversations (vector store, searchable)
    - Semantic memory: User facts and preferences (structured store)
    """

    def __init__(self, user_id: str):
        self.user_id = user_id
        self.working_memory = []         # Current session messages
        self.max_working = 20            # Max messages in working memory
        self.llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)

    def build_context(self, current_query: str) -> dict:
        """Build the optimal context for the current query."""
        context = {
            # Tier 1: Working memory (full recent messages)
            "working_memory": self.working_memory[-self.max_working:],

            # Tier 2: Episodic memory (relevant past conversations)
            "episodic_memory": self._search_episodic(current_query),

            # Tier 3: Semantic memory (user facts/preferences)
            "semantic_memory": self._get_user_facts(),
        }
        return context

    def _search_episodic(self, query: str) -> List[str]:
        """Search past conversations for relevant context."""
        # In production: query a vector store of past conversations
        return []  # Placeholder

    def _get_user_facts(self) -> List[str]:
        """Retrieve persistent user facts and preferences."""
        # In production: query Redis or a user profile database
        return []  # Placeholder

    def process_message(self, user_input: str) -> str:
        """Process a message with tiered memory context."""
        context = self.build_context(user_input)

        # Build system prompt with memory context
        system_parts = ["You are a helpful AI assistant."]

        if context["semantic_memory"]:
            facts = "\n".join(f"- {f}" for f in context["semantic_memory"])
            system_parts.append(f"\nUser facts:\n{facts}")

        if context["episodic_memory"]:
            episodes = "\n".join(f"- {e}" for e in context["episodic_memory"])
            system_parts.append(f"\nRelevant past context:\n{episodes}")

        prompt = ChatPromptTemplate.from_messages([
            ("system", "\n".join(system_parts)),
            MessagesPlaceholder(variable_name="history"),
            ("human", "{input}"),
        ])

        chain = prompt | self.llm | StrOutputParser()
        response = chain.invoke({
            "history": context["working_memory"],
            "input": user_input,
        })

        # Update working memory
        self.working_memory.append({"role": "user", "content": user_input})
        self.working_memory.append({"role": "assistant", "content": response})

        return response
Architecture Insight: The three-tier memory model mirrors how human memory works. Working memory (like our short-term memory) holds the current conversation. Episodic memory (like autobiographical memory) stores past experiences retrievable by similarity. Semantic memory (like our knowledge store) holds facts and preferences that are always relevant.

3. Context Engineering

Context engineering is the discipline of assembling the optimal context for each LLM call. With context windows ranging from 8K to 1M+ tokens, the challenge is not fitting everything in — it is choosing what to include and how to structure it for maximum effectiveness.

3.1 Context Strategy Design

Every LLM call has a finite context window, and filling it intelligently is one of the most impactful optimizations in AI application design. A context assembler allocates token budgets across competing priorities — system prompt, conversation history, retrieved documents, and tool results — ensuring high-priority content always fits while lower-priority content is gracefully truncated. This priority-based approach prevents the common failure mode where verbose retrieval results crowd out essential conversation context.

from dataclasses import dataclass, field
from typing import List, Optional
import tiktoken

@dataclass
class ContextBudget:
    """Define token budgets for each context component."""
    total_limit: int = 128000        # Model context window
    system_prompt: int = 2000        # System instructions
    user_query: int = 1000           # Current user input
    conversation_history: int = 4000  # Recent conversation
    retrieved_documents: int = 8000   # RAG context
    tool_results: int = 2000         # Tool call outputs
    output_reserved: int = 4000      # Reserved for generation

    @property
    def available_for_context(self) -> int:
        """Calculate available tokens after fixed allocations."""
        fixed = (self.system_prompt + self.user_query +
                 self.output_reserved)
        return self.total_limit - fixed

class ContextAssembler:
    """Assemble optimal context within token budget."""

    def __init__(self, budget: ContextBudget):
        self.budget = budget
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def count_tokens(self, text: str) -> int:
        return len(self.encoder.encode(text))

    def assemble(
        self,
        system_prompt: str,
        user_query: str,
        conversation_history: List[dict],
        retrieved_docs: List[str],
        tool_results: Optional[List[str]] = None,
    ) -> dict:
        """Assemble context with priority-based token allocation."""
        context = {"system": system_prompt, "query": user_query}
        remaining = self.budget.available_for_context

        # Priority 1: Recent conversation (most recent first)
        history_text = []
        history_budget = min(remaining, self.budget.conversation_history)
        used = 0
        for msg in reversed(conversation_history):
            msg_tokens = self.count_tokens(str(msg))
            if used + msg_tokens > history_budget:
                break
            history_text.insert(0, msg)
            used += msg_tokens
        context["history"] = history_text
        remaining -= used

        # Priority 2: Retrieved documents (most relevant first)
        doc_text = []
        doc_budget = min(remaining, self.budget.retrieved_documents)
        used = 0
        for doc in retrieved_docs:
            doc_tokens = self.count_tokens(doc)
            if used + doc_tokens > doc_budget:
                break
            doc_text.append(doc)
            used += doc_tokens
        context["documents"] = doc_text
        remaining -= used

        # Priority 3: Tool results
        if tool_results:
            tool_text = []
            tool_budget = min(remaining, self.budget.tool_results)
            used = 0
            for result in tool_results:
                result_tokens = self.count_tokens(result)
                if used + result_tokens > tool_budget:
                    break
                tool_text.append(result)
                used += result_tokens
            context["tool_results"] = tool_text

        return context

# Usage
budget = ContextBudget(total_limit=128000)
assembler = ContextAssembler(budget)

# Example conversation and RAG results (replace with your own data)
conversation = [
    {"role": "user", "content": "I need help with patent law research."},
    {"role": "assistant", "content": "I can help with that. What specific area?"},
]
rag_results = [
    "Document 1: Key patent infringement precedents include...",
    "Document 2: The Alice Corp v. CLS Bank case established...",
]

context = assembler.assemble(
    system_prompt="You are a legal research assistant...",
    user_query="What are the key precedents for patent infringement?",
    conversation_history=conversation,
    retrieved_docs=rag_results,
)
print(f"Context keys: {list(context.keys())}")
print(f"Documents included: {len(context.get('documents', []))}")

3.2 Intelligent Chunking for Context

Standard text splitting treats all chunks equally, but context-aware chunking enriches each chunk with metadata about its position in the original document. By prepending contextual headers (document title, section, chunk index) to each chunk, the retriever can return results that the LLM can situate within the broader document structure. This technique significantly improves answer quality for long documents where a chunk’s meaning depends on its surrounding context.

from langchain_text_splitters import RecursiveCharacterTextSplitter

class ContextAwareChunker:
    """Chunk documents with context optimization in mind."""

    def __init__(self, target_chunk_tokens: int = 300):
        self.target = target_chunk_tokens
        self.splitter = RecursiveCharacterTextSplitter(
            chunk_size=target_chunk_tokens * 4,  # Approx chars per token
            chunk_overlap=50,
            separators=["\n\n", "\n", ". ", "; ", ", ", " "],
        )

    def chunk_with_context(self, text: str, metadata: dict) -> list:
        """Split text and enrich chunks with contextual headers."""
        raw_chunks = self.splitter.split_text(text)

        enriched = []
        for i, chunk in enumerate(raw_chunks):
            # Add a contextual header to each chunk
            header = (
                f"[Document: {metadata.get('title', 'Unknown')} | "
                f"Section: {metadata.get('section', 'N/A')} | "
                f"Chunk {i+1}/{len(raw_chunks)}]"
            )
            enriched.append(f"{header}\n{chunk}")

        return enriched

# Contextual headers help the LLM understand where each chunk comes from
chunker = ContextAwareChunker(target_chunk_tokens=300)

# Example document text (replace with your own content)
document_text = (
    "RAG systems benefit from intelligent chunking strategies. "
    "Smaller chunks improve retrieval precision while larger chunks "
    "provide better context for generation. The optimal chunk size "
    "depends on the embedding model and the nature of the documents."
)

chunks = chunker.chunk_with_context(
    document_text,
    {"title": "RAG Best Practices", "section": "Chunking Strategies"}
)
for chunk in chunks:
    print(chunk[:120], "...")

4. Re-Ranking

Vector similarity search is fast but imprecise. Re-ranking applies a more sophisticated model to reorder the initial retrieval results, dramatically improving precision. The pattern is: retrieve broadly (top-20), then re-rank precisely (top-5).

4.1 Cohere Rerank

Why langchain-classic? The ContextualCompressionRetriever, EmbeddingsFilter, and other retriever utilities were moved out of the main langchain package into langchain-classic starting with LangChain v0.3. This package contains legacy chains, agents, hub, memory classes, and retriever document compressors. Install it with pip install langchain-classic. See the full API reference for all available modules.
# pip install langchain-classic langchain-cohere langchain-chroma langchain-openai chromadb
import os
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_cohere import CohereRerank
from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings
from langchain_core.documents import Document

# Create sample documents for demonstration
sample_docs = [
    Document(page_content="Semantic chunking splits text at natural topic boundaries using embeddings."),
    Document(page_content="Fixed-size chunking uses a set number of tokens per chunk with overlap."),
    Document(page_content="Recursive character splitting tries multiple separators in order."),
    Document(page_content="Vector databases store high-dimensional embeddings for similarity search."),
    Document(page_content="Sentence-window retrieval embeds single sentences but retrieves surrounding context."),
]

# Base retriever - retrieve broadly from in-memory vector store
vectorstore = Chroma.from_documents(
    documents=sample_docs,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
)
base_retriever = vectorstore.as_retriever(
    search_kwargs={"k": 5}
)

# Cohere reranker - rerank to top 3
# Set your Cohere API key: export COHERE_API_KEY="..."
cohere_reranker = CohereRerank(
    model="rerank-english-v3.0",
    cohere_api_key=os.getenv("COHERE_API_KEY"),
    top_n=3,
)

# Combine: retrieve broadly, then rerank
reranking_retriever = ContextualCompressionRetriever(
    base_compressor=cohere_reranker,
    base_retriever=base_retriever,
)

# The reranker uses cross-attention between query and each document
# which is much more accurate than cosine similarity alone
results = reranking_retriever.invoke("What are the best chunking strategies for RAG?")
for doc in results:
    print(f"Score: {doc.metadata.get('relevance_score', 'N/A'):.4f}")
    print(f"Content: {doc.page_content[:100]}...\n")

4.2 ColBERT Late Interaction

ColBERT uses a "late interaction" mechanism — it embeds query tokens and document tokens independently, then computes fine-grained token-level similarity. This provides nearly cross-encoder quality at bi-encoder speed:

# pip install pylate
# ColBERT re-ranking with PyLate (successor to RAGatouille)
from pylate import models, rank

# Load a ColBERT model — late interaction for token-level matching
colbert = models.ColBERT(model_name_or_path="lightonai/GTE-ModernColBERT-v1")

# Sample documents to rerank
documents = [
    "Chunking strategies for legal documents require preserving clause boundaries and cross-references between sections.",
    "Vector databases store embeddings for fast similarity search using approximate nearest neighbors.",
    "Legal text chunking should respect paragraph structure, numbered lists, and hierarchical section numbering.",
    "Prompt engineering techniques include few-shot learning, chain-of-thought, and role prompting.",
    "Document splitting for contracts should keep related clauses together and preserve definition references.",
]

# Encode query and documents separately (late interaction)
query = "What chunking strategies work best for legal documents?"
query_embeddings = colbert.encode([query], is_query=True)
doc_embeddings = colbert.encode([documents], is_query=False)

# Rerank using MaxSim token-level similarity scoring
reranked = rank.rerank(
    documents_ids=[list(range(len(documents)))],
    queries_embeddings=query_embeddings,
    documents_embeddings=doc_embeddings,
)

# ColBERT advantages:
# - Token-level matching catches partial relevance
# - Query and document tokens are encoded independently (late interaction)
# - MaxSim scoring is more nuanced than single-vector cosine
for result in reranked[0]:
    doc_idx = result["id"]
    print(f"Score: {result['score']:.4f}")
    print(f"Content: {documents[doc_idx][:100]}...\n")

4.3 Cross-Encoder Re-Ranking

Cross-encoders are the most accurate re-ranking method because they process the query and document together as a single input, allowing full attention between all tokens. Unlike bi-encoders (which embed query and documents separately), cross-encoders capture fine-grained interactions — at the cost of being too slow for initial retrieval over millions of documents. The standard pattern is to use fast bi-encoder retrieval for the top-100, then cross-encoder re-ranking to find the best 5–10 results.

# pip install sentence-transformers langchain-core
from sentence_transformers import CrossEncoder
from langchain_core.documents import Document

# Cross-encoder: most accurate re-ranking (but slowest)
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-12-v2")

def rerank_with_cross_encoder(
    query: str,
    documents: list[Document],
    top_k: int = 5
) -> list[Document]:
    """Re-rank documents using a cross-encoder model."""
    # Create query-document pairs
    pairs = [(query, doc.page_content) for doc in documents]

    # Score all pairs (cross-encoder sees query + doc together)
    scores = cross_encoder.predict(pairs)

    # Sort by score (descending) and return top-k
    scored_docs = sorted(
        zip(documents, scores),
        key=lambda x: x[1],
        reverse=True
    )

    for doc, score in scored_docs[:top_k]:
        doc.metadata["rerank_score"] = float(score)

    return [doc for doc, _ in scored_docs[:top_k]]

# Sample documents (in production, these come from a vector store retrieval)
sample_docs = [
    Document(page_content="RAG optimization requires careful tuning of chunk size, overlap, and retrieval parameters."),
    Document(page_content="Vector databases store embeddings for fast similarity search using approximate nearest neighbors."),
    Document(page_content="Prompt engineering techniques include few-shot learning and chain-of-thought reasoning."),
    Document(page_content="Optimizing RAG pipelines involves re-ranking retrieved documents and filtering irrelevant results."),
    Document(page_content="Embedding models like text-embedding-3-small capture semantic meaning in dense vectors."),
    Document(page_content="Context window management is critical when building production RAG applications."),
]

# Rerank the sample documents
reranked = rerank_with_cross_encoder("RAG optimization", sample_docs, top_k=3)
for doc in reranked:
    print(f"Score: {doc.metadata['rerank_score']:.4f}")
    print(f"Content: {doc.page_content[:100]}...\n")
Re-Ranker Speed Quality Cost Best For
Cohere Rerank Fast (API) Excellent $1/1K queries Production, quick integration
ColBERT v2 Fast (pre-computed) Very Good Free (local GPU) Self-hosted, high throughput
Cross-Encoder Slow Highest Free (local GPU) Maximum accuracy, small candidate sets
LLM-based Slowest Highest $$ (API costs) Complex relevance criteria

5. Prompt Compression

When your context exceeds the token budget, you need to compress it intelligently — keeping the essential information while discarding redundancy. This is particularly important for RAG systems where retrieved documents contain a mix of relevant and irrelevant content.

5.1 LLMLingua & LongLLMLingua

LLMLingua uses a small language model to identify and remove low-information tokens from prompts, achieving ~50% compression with minimal quality loss. This is particularly valuable for RAG pipelines where retrieved documents consume most of the context window. The technique preserves forced tokens (like question marks and key entities) while stripping filler words, redundant phrases, and boilerplate — effectively letting you fit twice as much context into the same token budget.

# pip install llmlingua accelerate
# LLMLingua: compresses prompts by identifying and removing
# low-information tokens while preserving meaning
from llmlingua import PromptCompressor

compressor = PromptCompressor(
    model_name="microsoft/llmlingua-2-bert-base-multilingual-cased-meetingbank",
    device_map="cpu",       # Use "cuda" if GPU available
    use_llmlingua2=True,
)

# Original prompt (2000 tokens)
original_prompt = """
Context: Vector databases are specialized database systems designed to
store, index, and query high-dimensional vector embeddings. These vectors
are numerical representations of data (text, images, audio) that capture
semantic meaning. The primary advantage of vector databases over traditional
databases is their ability to perform similarity search at scale...
[much more text]

Question: What are the key benefits of vector databases?
"""

# Compressed prompt (saves ~60% tokens with minimal quality loss)
compressed = compressor.compress_prompt(
    original_prompt,
    rate=0.5,               # Target 50% compression
    force_tokens=["Question:", "Answer:"],  # Never remove these
)

print(f"Original: {compressed['origin_tokens']} tokens")
print(f"Compressed: {compressed['compressed_tokens']} tokens")
print(f"Ratio: {compressed['ratio']}")
print(f"Compressed prompt:\n{compressed['compressed_prompt']}")

5.2 Contextual Compression Retriever

LangChain’s ContextualCompressionRetriever wraps any base retriever and applies compressors to the retrieved documents before returning them. Compressors can extract only the relevant sentences (LLMChainExtractor), filter out irrelevant documents entirely (LLMChainFilter), use embedding similarity for fast filtering (EmbeddingsFilter), or combine multiple stages in a DocumentCompressorPipeline. This dramatically reduces the token cost of each retrieval while improving answer precision.

# pip install langchain-classic langchain-chroma langchain-openai chromadb
from langchain_classic.retrievers.contextual_compression import ContextualCompressionRetriever
from langchain_classic.retrievers.document_compressors import (
    LLMChainExtractor,
    LLMChainFilter,
    EmbeddingsFilter,
    DocumentCompressorPipeline,
)
from langchain_chroma import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.documents import Document

model = ChatOpenAI(model="gpt-4o-mini", temperature=0)

# Create sample vector store with documents
sample_docs = [
    Document(page_content="Buffer memory stores the full conversation history as a list of messages. Simple but grows unbounded."),
    Document(page_content="Summary memory uses an LLM to progressively summarize older messages, keeping context compact."),
    Document(page_content="Window memory keeps only the last K exchanges, discarding older messages automatically."),
    Document(page_content="Vector memory stores conversations as embeddings for semantic retrieval of relevant past context."),
    Document(page_content="Entity memory tracks named entities and their attributes across the conversation."),
]

vectorstore = Chroma.from_documents(
    documents=sample_docs,
    embedding=OpenAIEmbeddings(model="text-embedding-3-small"),
)
base_retriever = vectorstore.as_retriever(search_kwargs={"k": 5})

# Method 1: LLM extracts only relevant portions
llm_extractor = LLMChainExtractor.from_llm(model)
compression_retriever = ContextualCompressionRetriever(
    base_compressor=llm_extractor,
    base_retriever=base_retriever,
)
# Returns only the sentences/paragraphs relevant to the query

# Method 2: LLM filters out irrelevant documents entirely
llm_filter = LLMChainFilter.from_llm(model)
filtering_retriever = ContextualCompressionRetriever(
    base_compressor=llm_filter,
    base_retriever=base_retriever,
)
# Removes documents that are not relevant (binary yes/no)

# Method 3: Embedding-based relevance filter (fast, no LLM cost)
embeddings_filter = EmbeddingsFilter(
    embeddings=OpenAIEmbeddings(model="text-embedding-3-small"),
    similarity_threshold=0.3,   # Tune per dataset; 0.75 is too aggressive for short docs
)

# Method 4: Pipeline - combine multiple compression strategies
compression_pipeline = DocumentCompressorPipeline(
    transformers=[
        embeddings_filter,     # First: fast embedding filter
        llm_extractor,         # Then: LLM extracts relevant parts
    ]
)
pipeline_retriever = ContextualCompressionRetriever(
    base_compressor=compression_pipeline,
    base_retriever=base_retriever,
)

# The pipeline first removes low-similarity docs (cheap),
# then extracts relevant portions from survivors (expensive but fewer docs)
results = pipeline_retriever.invoke("Best memory strategies for chatbots")
for doc in results:
    print(f"Content: {doc.page_content[:100]}")
Compression Strategy: Apply compression in layers: (1) embedding-based filtering to remove clearly irrelevant docs (fast, free), (2) re-ranking to reorder by relevance (fast, cheap), (3) LLM extraction to pull relevant sentences from top docs (slower, but most precise). This layered approach minimizes cost while maximizing quality.

6. Context Window Management

Different models have different context windows (8K to 1M+ tokens), and using them effectively requires deliberate token budgeting and overflow strategies.

6.1 Token Budgeting

Token budgeting is the practice of explicitly planning how many tokens each component of your prompt consumes, ensuring you never exceed the model’s context window while maximizing the useful content in every call. A TokenBudgetManager tracks allocations for system prompts, conversation history, retrieved context, and reserves space for the model’s response. It also estimates API costs per call — critical for production applications where token usage directly maps to infrastructure spend.

import tiktoken
from dataclasses import dataclass

@dataclass
class ModelConfig:
    name: str
    context_window: int
    cost_per_1k_input: float
    cost_per_1k_output: float

MODELS = {
    "gpt-4o": ModelConfig("gpt-4o", 128000, 0.0025, 0.01),
    "gpt-4o-mini": ModelConfig("gpt-4o-mini", 128000, 0.00015, 0.0006),
    "claude-3-5-sonnet": ModelConfig("claude-3-5-sonnet", 200000, 0.003, 0.015),
    "claude-3-haiku": ModelConfig("claude-3-haiku", 200000, 0.00025, 0.00125),
}

class TokenBudgetManager:
    """Manage token allocation across context components."""

    def __init__(self, model_name: str = "gpt-4o"):
        self.model = MODELS[model_name]
        self.encoder = tiktoken.encoding_for_model("gpt-4o")

    def plan_budget(
        self,
        system_prompt: str,
        user_query: str,
        max_output_tokens: int = 4096,
    ) -> dict:
        """Plan token budget for a request."""
        system_tokens = len(self.encoder.encode(system_prompt))
        query_tokens = len(self.encoder.encode(user_query))
        fixed_overhead = 50  # Message framing tokens

        used = system_tokens + query_tokens + fixed_overhead + max_output_tokens
        available = self.model.context_window - used

        # Allocate available tokens by priority
        budget = {
            "system_prompt": system_tokens,
            "user_query": query_tokens,
            "output_reserved": max_output_tokens,
            "available_for_context": available,
            "recommended_allocation": {
                "conversation_history": int(available * 0.25),
                "retrieved_documents": int(available * 0.60),
                "tool_results": int(available * 0.10),
                "buffer": int(available * 0.05),
            },
            "estimated_cost": self._estimate_cost(
                system_tokens + query_tokens + int(available * 0.85),
                max_output_tokens
            ),
        }
        return budget

    def _estimate_cost(self, input_tokens: int, output_tokens: int) -> float:
        input_cost = (input_tokens / 1000) * self.model.cost_per_1k_input
        output_cost = (output_tokens / 1000) * self.model.cost_per_1k_output
        return round(input_cost + output_cost, 6)

# Usage
manager = TokenBudgetManager("gpt-4o")
budget = manager.plan_budget(
    system_prompt="You are a legal research assistant...",
    user_query="What are the key precedents for software patent cases?",
)
print(f"Available for context: {budget['available_for_context']:,} tokens")
print(f"Recommended docs budget: {budget['recommended_allocation']['retrieved_documents']:,}")
print(f"Estimated cost: ${budget['estimated_cost']}")

6.2 Sliding Window Strategies

As conversations grow, they eventually exceed the context window. A sliding window strategy keeps the most recent messages while discarding older ones, maintaining conversational flow within the token budget. The key design decision is what to preserve: the system message should always stay (it defines agent behavior), and recent messages are most relevant. The implementation below uses tiktoken for precise token counting rather than rough character estimates, ensuring you never accidentally truncate mid-message.

from typing import List, Dict
import tiktoken

class SlidingWindowManager:
    """Manage conversation history with intelligent truncation."""

    def __init__(self, max_tokens: int = 4000, model: str = "gpt-4o"):
        self.max_tokens = max_tokens
        self.encoder = tiktoken.encoding_for_model(model)
        self.messages: List[Dict] = []

    def add_message(self, role: str, content: str):
        self.messages.append({"role": role, "content": content})

    def get_windowed_history(self) -> List[Dict]:
        """Return messages that fit within the token budget."""
        # Always include the system message (first message)
        result = []
        token_count = 0

        # Start from most recent and work backwards
        for msg in reversed(self.messages):
            msg_tokens = len(self.encoder.encode(msg["content"])) + 4
            if token_count + msg_tokens > self.max_tokens:
                break
            result.insert(0, msg)
            token_count += msg_tokens

        return result

    def get_smart_history(
        self,
        current_query: str,
        summarize_fn=None
    ) -> List[Dict]:
        """
        Smart windowing: keep recent messages verbatim,
        summarize older messages.
        """
        recent_budget = int(self.max_tokens * 0.7)
        summary_budget = int(self.max_tokens * 0.3)

        # Get recent messages that fit in 70% of budget
        recent = []
        token_count = 0
        for msg in reversed(self.messages):
            msg_tokens = len(self.encoder.encode(msg["content"])) + 4
            if token_count + msg_tokens > recent_budget:
                break
            recent.insert(0, msg)
            token_count += msg_tokens

        # Summarize older messages into 30% of budget
        older = self.messages[:len(self.messages) - len(recent)]
        if older and summarize_fn:
            summary = summarize_fn(older, max_tokens=summary_budget)
            return [{"role": "system", "content": f"Previous conversation summary: {summary}"}] + recent

        return recent

# Usage
window = SlidingWindowManager(max_tokens=4000)
window.add_message("user", "I'm building a RAG system for medical records")
window.add_message("assistant", "Medical RAG requires extra care with PII...")
window.add_message("user", "What embedding model should I use?")
window.add_message("assistant", "For medical text, consider PubMedBERT embeddings...")
window.add_message("user", "How should I handle HIPAA compliance?")
window.add_message("assistant", "HIPAA requires encryption at rest and in transit...")
# Add more messages as needed; older messages are automatically truncated
history = window.get_windowed_history()
print(f"Window contains {len(history)} messages")
Common Pitfall: Do not stuff the entire context window. LLMs perform worse when context is very long — important information in the middle of a long context is often ignored (the "lost in the middle" problem). Aim to use 50-70% of the context window and prioritize placing the most important information at the beginning and end of the context.

Exercises & Self-Assessment

Hands-On Exercises

  1. Memory Comparison: Build a chatbot with buffer memory and another with summary memory. Have a 50-turn conversation with each. Compare token usage, response quality, and ability to recall early conversation details.
  2. Long-Term Memory: Implement a Redis-backed long-term memory system. Build a chatbot that remembers user preferences across sessions. Test with 5 separate sessions and verify it recalls facts from session 1 in session 5.
  3. Re-Ranking Pipeline: Implement a RAG pipeline with and without Cohere re-ranking. Test with 20 queries. Measure precision@5 improvement from re-ranking. Calculate the cost per query for the re-ranking step.
  4. Context Budget Manager: Build a context assembler that takes a query, conversation history, and RAG results, and assembles optimal context within a 4000-token budget. Test with various input sizes and verify it never exceeds the budget.
  5. Compression Pipeline: Implement a two-stage compression pipeline (embedding filter + LLM extraction). Measure the token savings and quality impact compared to using full retrieved documents.

Critical Thinking Questions

  1. Why does the "lost in the middle" problem occur? How does this influence where you place important context (beginning vs. middle vs. end)?
  2. Compare the three-tier memory architecture (working/episodic/semantic) with how human memory works. Where does the analogy break down?
  3. Cross-encoder re-ranking is the most accurate but slowest approach. Design a system that gets cross-encoder quality at near-embedding speed. What approximations would you make?
  4. A user asks: "Should I use a 1M context window model and skip RAG entirely?" Write a detailed analysis of the trade-offs between very large context windows and RAG-based retrieval.
  5. You are building a customer support bot that handles 10,000 conversations per day. Design a memory system that balances personalization, cost, and latency. What would you store, where, and for how long?

Memory System Design Document Generator

Design and document an AI memory system architecture. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

You now understand how to give AI applications the ability to remember, how to engineer context for maximum effectiveness, and how to manage the finite resource of context windows. Here are the key takeaways from Part 6:

  • Memory types serve different needs — buffer for accuracy, summary for efficiency, window for predictability, vector for relevance, entity for structured knowledge
  • Long-term memory with persistent storage (Redis, PostgreSQL) enables cross-session personalization and relationship building
  • The three-tier memory architecture (working/episodic/semantic) provides a comprehensive framework for production memory systems
  • Context engineering is about assembling the optimal context within token budgets — prioritize, compress, and structure information deliberately
  • Re-ranking (Cohere, ColBERT, cross-encoders) dramatically improves retrieval precision by reordering initial results with more sophisticated models
  • Prompt compression (LLMLingua, contextual compression) saves tokens while preserving essential information
  • Context window management requires deliberate token budgeting, sliding windows, and awareness of the "lost in the middle" problem

Next in the Series

In Part 7: Agents — Core of Modern AI Apps, we explore the paradigm shift from chains to agents — autonomous systems that can reason, plan, use tools, and execute multi-step tasks. You will learn ReAct agents, tool-calling agents, planner-executor patterns, and how to combine RAG and memory with agent capabilities.

Technology