AI Application Development Mastery Part 5: Retrieval-Augmented Generation (RAG)

Introduction: Why RAG Changes Everything

                        
                        Series Overview: This is Part 5 of our 18-part AI Application Development Mastery series. We will take you from foundational understanding through prompt engineering, LangChain, RAG systems, agents, LangGraph, multi-agent architectures, production deployment, and building real-world AI applications.
                    

AI Application Development Mastery

Your 20-step learning path • Currently on Step 5

1

5

Retrieval-Augmented Generation (RAG)

Embeddings, vector DBs, retrievers, RAG pipelines

You Are Here

6

Memory & Context Engineering

Buffer/summary/vector memory, chunking, re-ranking

7

Agents — Core of Modern AI Apps

ReAct, tool-calling, planner-executor agents

8

LangGraph — Stateful Agent Workflows

Nodes, edges, state, graph execution, cycles

9

Deep Agents & Autonomous Systems

Multi-step reasoning, self-reflection, planning

10

Multi-Agent Systems

Supervisor, swarm, debate, role-based collaboration

11

AI Application Design Patterns

RAG, chat+memory, workflow automation, agent loops

12

Ecosystem & Frameworks

LlamaIndex, Haystack, HuggingFace, vLLM

13

MCP Foundations & Architecture

Protocol design, Host/Client/Server, primitives, security

14

MCP in Production

Building servers, integrations, scaling, agent systems

15

Evaluation & LLMOps

Prompt eval, tracing, LangSmith, experiment tracking

16

Production AI Systems

APIs, queues, caching, streaming, scaling

17

Safety, Guardrails & Reliability

Input filtering, hallucination mitigation, prompt injection

18

Advanced Topics

Fine-tuning, tool learning, hybrid LLM+symbolic

19

Building Real AI Applications

Chatbot, document QA, coding assistant, full-stack

20

Future of AI Applications

Autonomous agents, self-improving, multi-modal, AI OS

Large Language Models are powerful, but they have a fundamental limitation: they can only work with information from their training data. RAG (Retrieval-Augmented Generation) solves this by giving LLMs access to external knowledge at query time — your documents, databases, APIs, and any other data source. Instead of hoping the model "knows" the answer, you retrieve the relevant information and augment the prompt before generating a response.

RAG is not just a technique — it is the dominant architecture for production AI applications. From customer support bots that answer questions about your product to legal research tools that cite specific case law, from enterprise search that spans millions of documents to coding assistants that understand your codebase — RAG powers them all.

                        
                        Key Insight: RAG is preferred over fine-tuning for most use cases because it is cheaper, faster to iterate, provides source attribution, and keeps data fresh without retraining. Fine-tuning changes how the model behaves; RAG changes what the model knows.
                    

RAG Component	What You Will Learn
Embeddings	Convert text to vectors using OpenAI and open-source models
Vector Databases	Compare and implement FAISS, Pinecone, Weaviate, Chroma, pgvector, Qdrant
Document Loaders	Ingest PDFs, web pages, databases, APIs, and structured data
Text Splitting	Chunk documents intelligently for optimal retrieval
Retriever Patterns	Similarity search, MMR, multi-query, and ensemble retrievers
Advanced RAG	HyDE, RAG fusion, parent document retrieval, and more

1. Embeddings

Embeddings are the foundation of RAG. They transform text into dense numerical vectors that capture semantic meaning, allowing us to find similar content through mathematical distance calculations rather than keyword matching.

1.1 OpenAI Embeddings

OpenAI's embedding models convert text into dense numerical vectors that capture semantic meaning. The text-embedding-3-small model offers an excellent balance of quality and cost, producing 1536-dimensional vectors suitable for most retrieval tasks. These embeddings enable semantic similarity comparisons — texts with similar meanings produce vectors that are close together in the embedding space, even when they use completely different words.

# pip install langchain langchain-openai langchain-community chromadb faiss-cpu
# pip install langchain-pinecone langchain-qdrant langchain-postgres
# pip install langchain-text-splitters sentence-transformers numpy
import os
from langchain_openai import OpenAIEmbeddings
import numpy as np

# Set your API key: export OPENAI_API_KEY="sk-..."
os.environ.setdefault("OPENAI_API_KEY", os.getenv("OPENAI_API_KEY", ""))

# OpenAI's text-embedding-3-small (recommended for most use cases)
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",  # 1536 dimensions
    # model="text-embedding-3-large",  # 3072 dimensions (higher quality)
)

# Embed a single text
vector = embeddings.embed_query("What is retrieval-augmented generation?")
print(f"Dimensions: {len(vector)}")  # 1536
print(f"First 5 values: {vector[:5]}")

# Embed multiple documents (batched for efficiency)
documents = [
    "RAG retrieves relevant documents before generating responses.",
    "Vector databases store embeddings for fast similarity search.",
    "Chunking splits documents into smaller pieces for indexing.",
    "Embeddings capture semantic meaning as numerical vectors.",
]
doc_vectors = embeddings.embed_documents(documents)
print(f"Embedded {len(doc_vectors)} documents")

# Calculate cosine similarity between query and documents
def cosine_similarity(a, b):
    return np.dot(a, b) / (np.linalg.norm(a) * np.linalg.norm(b))

query_vector = embeddings.embed_query("How does RAG work?")
for i, doc_vec in enumerate(doc_vectors):
    similarity = cosine_similarity(query_vector, doc_vec)
    print(f"Doc {i}: {similarity:.4f} - {documents[i][:50]}...")

1.2 Sentence Transformers (Open Source)

For applications where API costs, data privacy, or offline operation matter, open-source embedding models provide a compelling alternative. The BAAI/bge-large-en-v1.5 model from HuggingFace achieves near-commercial quality while running entirely on your own hardware. These models automatically leverage GPU acceleration when available, making them practical for both development and production workloads without any external API dependencies.

# pip install langchain-huggingface sentence-transformers
import torch
from langchain_huggingface import HuggingFaceEmbeddings

# Auto-detect GPU: uses CUDA if available, otherwise falls back to CPU
device = "cuda" if torch.cuda.is_available() else "cpu"

# Open-source alternatives - no API costs, runs locally
embeddings_local = HuggingFaceEmbeddings(
    model_name="BAAI/bge-large-en-v1.5",  # Top-performing open model
    model_kwargs={"device": device},        # Use GPU if available
    encode_kwargs={"normalize_embeddings": True},  # For cosine similarity
)

# Other excellent open-source embedding models:
# "sentence-transformers/all-MiniLM-L6-v2"  - Fast, 384 dims
# "BAAI/bge-small-en-v1.5"                  - Small, 384 dims
# "BAAI/bge-large-en-v1.5"                  - Best quality, 1024 dims
# "intfloat/e5-large-v2"                    - Strong multilingual
# "nomic-ai/nomic-embed-text-v1.5"          - 768 dims, Matryoshka

vector = embeddings_local.embed_query("RAG is essential for production AI apps")
print(f"Local embedding dimensions: {len(vector)}")

1.3 Choosing the Right Embedding Model

Model	Dimensions	Speed	Quality	Cost	Best For
text-embedding-3-small	1536	Fast (API)	Very Good	$0.02/1M tokens	General purpose, low cost
text-embedding-3-large	3072	Fast (API)	Excellent	$0.13/1M tokens	High-accuracy retrieval
bge-large-en-v1.5	1024	Medium	Excellent	Free (local)	On-premise, privacy-critical
all-MiniLM-L6-v2	384	Very Fast	Good	Free (local)	Low-latency, resource-constrained
nomic-embed-text-v1.5	768	Fast	Very Good	Free (local)	Long documents (8192 tokens)

2. Vector Databases

Vector databases are purpose-built for storing, indexing, and querying high-dimensional vectors. They are the backbone of every RAG system, enabling millisecond-latency similarity search across millions or billions of embeddings.

2.1 Comprehensive Vector Database Comparison

Feature	FAISS	Pinecone	Weaviate	Chroma	pgvector	Qdrant
Type	Library	Managed SaaS	Self-hosted / Cloud	Embedded / Server	PostgreSQL extension	Self-hosted / Cloud
Hosting	In-process	Fully managed	Docker / Weaviate Cloud	In-process / Docker	Any PostgreSQL host	Docker / Qdrant Cloud
Persistence	Manual save/load	Automatic	Automatic	Automatic (SQLite)	PostgreSQL storage	Automatic (RocksDB)
Max Vectors	Billions (RAM)	Billions	Billions	Millions	Millions	Billions
Index Types	Flat, IVF, HNSW, PQ	Proprietary (auto)	HNSW	HNSW	IVFFlat, HNSW	HNSW
Metadata Filtering	No (manual)	Yes (rich filters)	Yes (GraphQL-style)	Yes (where clauses)	Yes (SQL WHERE)	Yes (payload filters)
Hybrid Search	No	Yes (sparse+dense)	Yes (BM25+vector)	No	Yes (with tsvector)	Yes (sparse+dense)
Multi-tenancy	Manual	Namespaces	Built-in tenants	Collections	Schemas/RLS	Collections + payload
Language	C++ (Python bindings)	Python, JS, Go, Java	Go (all client SDKs)	Python (Rust core)	C (PostgreSQL)	Rust (all client SDKs)
Pricing	Free (open source)	From $70/month	Free (self-host) / SaaS	Free (open source)	Free (with PostgreSQL)	Free (self-host) / SaaS
Best For	Prototyping, research, maximum speed	Production SaaS, zero-ops	Knowledge graphs + vectors	Local dev, prototyping	Existing PostgreSQL stacks	Production self-hosted, filtering

2.2 Implementation Examples

Each vector database has distinct strengths: FAISS excels at local prototyping with in-memory speed, Chroma provides a developer-friendly experience with built-in persistence, and Pinecone offers fully managed cloud infrastructure for production scale. The following examples demonstrate the core workflow for each — creating a vector store from documents, persisting the index, and running similarity searches.

FAISS (Local, In-Memory)

# pip install faiss-cpu langchain-community
from langchain_openai import OpenAIEmbeddings
from langchain_community.vectorstores import FAISS
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="RAG combines retrieval with generation for accurate AI responses.",
             metadata={"source": "rag-guide.pdf", "page": 1}),
    Document(page_content="Vector databases store embeddings for millisecond similarity search.",
             metadata={"source": "vectordb-intro.pdf", "page": 5}),
    Document(page_content="Text chunking strategies significantly impact RAG quality.",
             metadata={"source": "rag-guide.pdf", "page": 12}),
    Document(page_content="HNSW indexing provides logarithmic search time complexity.",
             metadata={"source": "algorithms.pdf", "page": 34}),
]

faiss_store = FAISS.from_documents(docs, embeddings)
faiss_store.save_local("./faiss_index")  # Persist to disk
# faiss_store = FAISS.load_local("./faiss_index", embeddings,
#     allow_dangerous_deserialization=True)

results = faiss_store.similarity_search("How does RAG work?", k=3)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

Chroma (Local, Persistent)

# pip install langchain-chroma chromadb
from langchain_openai import OpenAIEmbeddings
from langchain_chroma import Chroma
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="RAG combines retrieval with generation for accurate AI responses.",
             metadata={"source": "rag-guide.pdf", "page": 1}),
    Document(page_content="Vector databases store embeddings for millisecond similarity search.",
             metadata={"source": "vectordb-intro.pdf", "page": 5}),
    Document(page_content="Text chunking strategies significantly impact RAG quality.",
             metadata={"source": "rag-guide.pdf", "page": 12}),
    Document(page_content="HNSW indexing provides logarithmic search time complexity.",
             metadata={"source": "algorithms.pdf", "page": 34}),
]

chroma_store = Chroma.from_documents(
    docs, embeddings,
    collection_name="rag_docs",
)

results = chroma_store.similarity_search("vector search", k=3)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

Pinecone (Managed Cloud)

# pip install langchain-pinecone pinecone
import os
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_core.documents import Document
from pinecone import Pinecone, ServerlessSpec

# Use 512 dimensions to match the Pinecone index configuration
embeddings = OpenAIEmbeddings(model="text-embedding-3-small", dimensions=512)

docs = [
    Document(page_content="RAG combines retrieval with generation for accurate AI responses.",
             metadata={"source": "rag-guide.pdf", "page": 1}),
    Document(page_content="Vector databases store embeddings for millisecond similarity search.",
             metadata={"source": "vectordb-intro.pdf", "page": 5}),
]

# Requires: export PINECONE_API_KEY="pc-..."
pc = Pinecone(api_key=os.getenv("PINECONE_API_KEY"))

# Create index (only needed once — skip if already exists)
# pc.create_index(
#     name="rag-index",
#     dimension=512,
#     metric="cosine",
#     spec=ServerlessSpec(cloud="aws", region="us-east-1"),
# )

pinecone_store = PineconeVectorStore.from_documents(
    docs, embeddings, index_name="rag-index"
)
results = pinecone_store.similarity_search("RAG pipelines", k=2)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

Qdrant (Self-Hosted or Cloud)

# pip install langchain-qdrant qdrant-client
# Requires: docker run -p 6333:6333 qdrant/qdrant
from langchain_openai import OpenAIEmbeddings
from langchain_qdrant import QdrantVectorStore
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="RAG combines retrieval with generation for accurate AI responses.",
             metadata={"source": "rag-guide.pdf", "page": 1}),
    Document(page_content="Text chunking strategies significantly impact RAG quality.",
             metadata={"source": "rag-guide.pdf", "page": 12}),
]

qdrant_store = QdrantVectorStore.from_documents(
    docs, embeddings,
    url="http://localhost:6333",
    collection_name="rag_docs",
)

results = qdrant_store.similarity_search("RAG techniques", k=2)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

pgvector (PostgreSQL Extension)

# pip install langchain-postgres psycopg[binary]
# Requires: PostgreSQL with pgvector extension enabled
from langchain_openai import OpenAIEmbeddings
from langchain_postgres import PGVector
from langchain_core.documents import Document

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

docs = [
    Document(page_content="RAG combines retrieval with generation for accurate AI responses.",
             metadata={"source": "rag-guide.pdf", "page": 1}),
    Document(page_content="HNSW indexing provides logarithmic search time complexity.",
             metadata={"source": "algorithms.pdf", "page": 34}),
]

pgvector_store = PGVector.from_documents(
    docs, embeddings,
    connection="postgresql+psycopg://user:pass@localhost:5432/vectordb",
    collection_name="rag_docs",
)

results = pgvector_store.similarity_search("indexing algorithms", k=2)
for doc in results:
    print(f"[{doc.metadata['source']}] {doc.page_content[:80]}...")

                        
                        Decision Guide: Start with Chroma for prototyping (zero config). Move to Qdrant or Weaviate for production self-hosted. Use Pinecone if you want fully managed. Use pgvector if you already run PostgreSQL. Use FAISS for research or when you need maximum raw speed.
                    

3. Document Loading & Processing

Before documents can be embedded and stored in a vector database, they need to be loaded from their source format and split into appropriately-sized chunks. This ingestion pipeline is critical — poor chunking leads to poor retrieval, regardless of how good your embedding model or vector database is.

RAG Document Ingestion Pipeline

flowchart LR
    A["Documents"] --> B["Document
Loader"]
    B --> C["Text
Splitter"]
    C --> D["Embedding
Model"]
    D --> E["Vector
Store"]
    E --> F["Retriever"]
    F --> G["LLM
Generation"]

    style A fill:#e8f4f4,stroke:#3B9797
    style E fill:#f0f4f8,stroke:#16476A
    style G fill:#e8f4f4,stroke:#3B9797

3.1 Document Loaders

Before text can be embedded and retrieved, it must be loaded from its source format into LangChain's Document objects. LangChain provides loaders for virtually every data source — PDFs, web pages, CSVs, Markdown files, JSON with jq extraction, and even Git repositories. Each loader handles format-specific parsing (OCR for scanned PDFs, HTML stripping for web pages, schema extraction for JSON) and attaches source metadata that flows through the entire RAG pipeline.

# pip install pypdf unstructured jq gitpython beautifulsoup4
from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    WebBaseLoader,
    DirectoryLoader,
    UnstructuredMarkdownLoader,
    JSONLoader,
    GitLoader,
)

# PDF documents
pdf_loader = PyPDFLoader("./docs/rag-paper.pdf")
pdf_docs = pdf_loader.load()  # One Document per page

# Web pages
web_loader = WebBaseLoader([
    "https://docs.langchain.com/docs/get-started/introduction",
    "https://python.langchain.com/docs/concepts/",
])
web_docs = web_loader.load()

# CSV files (each row becomes a Document)
csv_loader = CSVLoader(
    "./data/products.csv",
    csv_args={"delimiter": ","},
    source_column="product_name",
)
csv_docs = csv_loader.load()

# Entire directories (recursive)
dir_loader = DirectoryLoader(
    "./docs/",
    glob="**/*.md",
    loader_cls=UnstructuredMarkdownLoader,
    show_progress=True,
)
all_docs = dir_loader.load()

# JSON with jq-style extraction
json_loader = JSONLoader(
    "./data/api-responses.json",
    jq_schema=".results[].content",
    text_content=False,
)
json_docs = json_loader.load()

# Git repositories
git_loader = GitLoader(
    clone_url="https://github.com/langchain-ai/langchain",
    repo_path="./repos/langchain",
    branch="master",
    file_filter=lambda f: f.endswith(".py"),
)
code_docs = git_loader.load()

print(f"Loaded {len(all_docs)} documents from directory")

3.2 Text Splitting Strategies

Raw documents are rarely the right size for embedding — they need to be split into chunks that balance semantic coherence with retrieval precision. LangChain offers six splitting strategies, each suited to different content types: RecursiveCharacterTextSplitter (the default, splits on natural boundaries), TokenTextSplitter (precise token-count control), MarkdownHeaderTextSplitter (preserves document structure), and SemanticChunker (groups by meaning using embeddings). Choosing the right splitter and chunk size directly impacts retrieval quality.

# pip install langchain-text-splitters tiktoken
from langchain_text_splitters import (
    RecursiveCharacterTextSplitter,
    CharacterTextSplitter,
    TokenTextSplitter,
    MarkdownHeaderTextSplitter,
    HTMLHeaderTextSplitter,
    SemanticChunker,
)
from langchain_openai import OpenAIEmbeddings

# Assumes pdf_docs were loaded in the previous block (Document Loaders)

# Recursive Character Splitter (RECOMMENDED for most use cases)
# Tries to split on paragraphs, then sentences, then words
recursive_splitter = RecursiveCharacterTextSplitter(
    chunk_size=1000,        # Max characters per chunk
    chunk_overlap=200,      # Overlap between chunks (preserves context)
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""],  # Priority order
)
chunks = recursive_splitter.split_documents(pdf_docs)

# Token-based splitter (respects model token limits)
token_splitter = TokenTextSplitter(
    chunk_size=500,         # Max tokens per chunk
    chunk_overlap=50,       # Token overlap
    encoding_name="cl100k_base",  # GPT-4/3.5 tokenizer
)
token_chunks = token_splitter.split_documents(pdf_docs)

# Markdown-aware splitter (preserves document structure)
md_splitter = MarkdownHeaderTextSplitter(
    headers_to_split_on=[
        ("#", "Header 1"),
        ("##", "Header 2"),
        ("###", "Header 3"),
    ]
)
# Example: split a markdown string (replace with your own content)
markdown_content = "# Title\n## Section 1\nContent here.\n## Section 2\nMore content."
md_chunks = md_splitter.split_text(markdown_content)

# Semantic chunker (splits based on embedding similarity)
# Groups sentences that are semantically similar
semantic_splitter = SemanticChunker(
    OpenAIEmbeddings(),
    breakpoint_threshold_type="percentile",
    breakpoint_threshold_amount=95,
)
semantic_chunks = semantic_splitter.split_documents(pdf_docs)

# Compare chunk sizes
print(f"Recursive: {len(chunks)} chunks (avg {sum(len(c.page_content) for c in chunks)//len(chunks)} chars)")
print(f"Token: {len(token_chunks)} chunks")
print(f"Semantic: {len(semantic_chunks)} chunks")

                        
                        Chunking Matters More Than You Think: The single biggest factor in RAG quality is chunking strategy. Chunks that are too small lose context. Chunks that are too large dilute relevance. A good starting point: 500-1000 characters with 10-20% overlap. Use semantic chunking for documents where topic boundaries do not align with paragraph breaks.
                    

4. Retriever Patterns

Retrievers are the bridge between user queries and stored documents. The choice of retrieval strategy dramatically impacts the quality and diversity of results. LangChain provides several retriever patterns, each optimized for different scenarios.

4.1 Similarity Search (Baseline)

The simplest retrieval strategy is pure similarity search — given a query, find the k most similar documents by vector distance. While straightforward, LangChain enhances this baseline with score thresholds (filtering out low-relevance results) and metadata filters. The retriever interface wraps any vector store into a standard Runnable that integrates directly into LCEL chains, making it easy to swap retrieval strategies without changing your pipeline.

from langchain_community.vectorstores import Chroma
from langchain_openai import ChatOpenAI, OpenAIEmbeddings

# Assumes you have a Chroma DB populated from the previous section
vectorstore = Chroma(
    persist_directory="./chroma_db",
    embedding_function=OpenAIEmbeddings()
)

# Basic similarity search retriever
retriever = vectorstore.as_retriever(
    search_type="similarity",
    search_kwargs={
        "k": 5,                    # Number of results
        "score_threshold": 0.7,    # Minimum similarity score
    }
)

# Use in a RAG chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.runnables import RunnablePassthrough
from langchain_core.output_parsers import StrOutputParser
from langchain_openai import ChatOpenAI

def format_docs(docs):
    return "\n\n---\n\n".join(
        f"[Source: {d.metadata.get('source', 'unknown')}, "
        f"Page: {d.metadata.get('page', 'N/A')}]\n{d.page_content}"
        for d in docs
    )

rag_prompt = ChatPromptTemplate.from_template(
    "Answer the question based only on the following context. "
    "If the context does not contain the answer, say so.\n\n"
    "Context:\n{context}\n\n"
    "Question: {question}\n\n"
    "Answer:"
)

rag_chain = (
    {"context": retriever | format_docs, "question": RunnablePassthrough()}
    | rag_prompt
    | ChatOpenAI(model="gpt-4o", temperature=0)
    | StrOutputParser()
)

answer = rag_chain.invoke("What text splitting strategies improve RAG quality?")

4.2 Maximum Marginal Relevance (MMR)

MMR balances relevance with diversity — it penalizes documents that are too similar to already-selected results. This prevents the common problem of retrieving 5 nearly-identical chunks:

# MMR retriever - balances relevance and diversity
mmr_retriever = vectorstore.as_retriever(
    search_type="mmr",
    search_kwargs={
        "k": 5,                # Final number of results
        "fetch_k": 20,         # Candidates to consider (higher = more diverse)
        "lambda_mult": 0.7,    # 0=max diversity, 1=max relevance
    }
)

# Compare: similarity search might return 5 chunks all about the same subtopic
# MMR ensures each chunk covers a different aspect of the answer
sim_results = vectorstore.similarity_search("RAG best practices", k=5)
mmr_results = vectorstore.max_marginal_relevance_search(
    "RAG best practices", k=5, fetch_k=20
)

4.3 Multi-Query Retriever

Multi-query retrieval uses an LLM to generate multiple perspectives of the same question, then retrieves documents for each perspective and merges the results. This overcomes the limitation of single-query retrieval where the user's phrasing might not match the document's phrasing:

from langchain.retrievers import MultiQueryRetriever
from langchain_openai import ChatOpenAI

# The LLM generates alternative queries
multi_retriever = MultiQueryRetriever.from_llm(
    retriever=vectorstore.as_retriever(search_kwargs={"k": 5}),
    llm=ChatOpenAI(model="gpt-4o-mini", temperature=0.3),
)

# User asks: "How do I make RAG faster?"
# LLM generates:
# 1. "What techniques reduce latency in RAG pipelines?"
# 2. "How to optimize retrieval speed in vector search?"
# 3. "Performance tuning strategies for RAG applications"
# Each query retrieves relevant docs, results are merged and deduplicated

results = multi_retriever.invoke("How do I make RAG faster?")
print(f"Retrieved {len(results)} unique documents from multiple query perspectives")

                        
                        Retriever Selection Guide: Use similarity search as your baseline. Switch to MMR when you notice redundant results. Use multi-query when users phrase questions in unexpected ways. For production, combine them: multi-query for query expansion, then MMR for result diversification.
                    

5. Advanced RAG Techniques

Basic RAG (embed-retrieve-generate) works well for simple use cases, but production systems often need more sophisticated approaches to handle complex queries, improve retrieval precision, and maintain context across document hierarchies.

5.1 HyDE (Hypothetical Document Embeddings)

HyDE asks the LLM to generate a hypothetical answer to the query, then embeds that hypothetical answer (instead of the query) to find similar real documents. The intuition is that a hypothetical answer is more semantically similar to the real documents than the original question:

from langchain.chains import HypotheticalDocumentEmbedder
from langchain_openai import ChatOpenAI, OpenAIEmbeddings
from langchain_core.prompts import ChatPromptTemplate

# HyDE: Generate a hypothetical answer, then embed it for retrieval
hyde_prompt = ChatPromptTemplate.from_template(
    "Please write a detailed passage that would answer the "
    "following question. Write it as if it were a paragraph "
    "from an authoritative technical document.\n\n"
    "Question: {question}\n\nPassage:"
)

# Manual HyDE implementation for clarity
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnableLambda

model = ChatOpenAI(model="gpt-4o-mini", temperature=0.5)
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")

def hyde_retrieve(question: str, vectorstore, k: int = 5):
    """HyDE: Generate hypothetical doc, embed it, search for similar real docs."""
    # Step 1: Generate hypothetical answer
    hypothetical = (hyde_prompt | model | StrOutputParser()).invoke(
        {"question": question}
    )

    # Step 2: Embed the hypothetical answer (not the question!)
    hyde_vector = embeddings.embed_query(hypothetical)

    # Step 3: Search for real documents similar to the hypothetical
    results = vectorstore.similarity_search_by_vector(hyde_vector, k=k)

    return results, hypothetical

results, hypo = hyde_retrieve("What makes HNSW indexing fast?", vectorstore)
print(f"Hypothetical doc: {hypo[:200]}...")
print(f"Retrieved {len(results)} real documents")

5.2 RAG Fusion

RAG Fusion combines multi-query retrieval with Reciprocal Rank Fusion (RRF) scoring. It generates multiple query variants, retrieves results for each, and then fuses the ranked lists using RRF — a technique from information retrieval that produces better rankings than any individual query:

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser

def rag_fusion(question: str, vectorstore, k: int = 5, num_queries: int = 4):
    """RAG Fusion: Multi-query + Reciprocal Rank Fusion."""

    # Step 1: Generate multiple query variants
    query_gen_prompt = ChatPromptTemplate.from_template(
        "Generate {num_queries} different search queries that would "
        "help answer this question from different angles. "
        "Return one query per line, no numbering.\n\n"
        "Question: {question}"
    )
    query_chain = query_gen_prompt | ChatOpenAI(model="gpt-4o-mini") | StrOutputParser()
    query_text = query_chain.invoke({
        "question": question, "num_queries": num_queries
    })
    queries = [q.strip() for q in query_text.strip().split("\n") if q.strip()]

    # Step 2: Retrieve results for each query
    all_results = {}
    for query in queries:
        docs = vectorstore.similarity_search(query, k=k)
        for rank, doc in enumerate(docs):
            doc_id = doc.page_content[:100]  # Use content prefix as ID
            if doc_id not in all_results:
                all_results[doc_id] = {"doc": doc, "ranks": []}
            all_results[doc_id]["ranks"].append(rank + 1)

    # Step 3: Reciprocal Rank Fusion scoring
    K = 60  # RRF constant (standard value from the literature)
    scored = []
    for doc_id, data in all_results.items():
        rrf_score = sum(1.0 / (K + rank) for rank in data["ranks"])
        scored.append((data["doc"], rrf_score))

    # Step 4: Sort by RRF score (highest first)
    scored.sort(key=lambda x: x[1], reverse=True)

    return [(doc, score) for doc, score in scored[:k]]

# Usage
fused_results = rag_fusion("Best practices for production RAG systems", vectorstore)
for doc, score in fused_results:
    print(f"RRF Score: {score:.4f} - {doc.page_content[:80]}...")

5.3 Parent Document Retriever

The parent document retriever solves a fundamental tension in RAG: small chunks are better for precise retrieval, but large chunks provide better context for generation. It indexes small chunks for search but returns the larger parent document for the LLM:

from langchain.retrievers import ParentDocumentRetriever
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain.storage import InMemoryStore
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings

# Assumes pdf_docs were loaded via PyPDFLoader in the Document Loaders section

# Child splitter: small chunks for precise retrieval
child_splitter = RecursiveCharacterTextSplitter(
    chunk_size=200,      # Small chunks for matching
    chunk_overlap=20,
)

# Parent splitter: larger chunks for context
parent_splitter = RecursiveCharacterTextSplitter(
    chunk_size=2000,     # Large chunks for the LLM
    chunk_overlap=200,
)

# Vector store for child chunks
vectorstore = Chroma(
    collection_name="child_chunks",
    embedding_function=OpenAIEmbeddings()
)

# Document store for parent chunks
docstore = InMemoryStore()

# Create the parent document retriever
parent_retriever = ParentDocumentRetriever(
    vectorstore=vectorstore,
    docstore=docstore,
    child_splitter=child_splitter,
    parent_splitter=parent_splitter,
)

# Index documents (automatically creates parent and child chunks)
parent_retriever.add_documents(pdf_docs)

# Search: matches on small child chunks, returns large parent chunks
results = parent_retriever.invoke("HNSW algorithm complexity")
# Each result is a large parent chunk (~2000 chars)
# that was matched via its small child chunk (~200 chars)
for doc in results:
    print(f"Parent chunk ({len(doc.page_content)} chars): {doc.page_content[:100]}...")

                        
                        Advanced RAG Pipeline: For maximum quality, combine these techniques: Use semantic chunking for intelligent splitting, parent document retrieval for context preservation, multi-query for query expansion, MMR for diversity, and re-ranking (covered in Part 6) for final result ordering. This "full stack RAG" approach consistently outperforms naive RAG.
                    

Exercises & Self-Assessment

Hands-On Exercises

Embedding Comparison: Embed the same 100 documents with text-embedding-3-small and bge-large-en-v1.5. Run 20 test queries against both. Compare retrieval quality (precision@5) and latency. Which model wins for your domain?
Vector DB Benchmark: Load 10,000 documents into both Chroma and FAISS. Measure indexing time, search latency (p50, p95, p99), and memory usage. At what scale does the performance difference matter?
Chunking Experiment: Take a 50-page PDF and split it with recursive (500 chars), token (200 tokens), and semantic chunking. Build a RAG pipeline with each and test with 10 questions. Which chunking strategy produces the best answers?
Advanced RAG: Implement a full RAG pipeline that combines HyDE for query expansion, parent document retrieval for context, and MMR for diversity. Compare its performance against basic similarity search RAG.
Production Pipeline: Build a complete document ingestion pipeline that handles PDFs, Markdown, and web pages. Include deduplication, metadata extraction, and automatic chunking with overlap. Persist everything to a vector database of your choice.

Critical Thinking Questions

Why does RAG typically outperform fine-tuning for incorporating domain knowledge? In what scenarios would fine-tuning be the better choice?
Explain the trade-off between chunk size and retrieval quality. How does the parent document retriever resolve this tension?
You are building a RAG system for a legal firm with 10 million documents. Which vector database would you choose and why? Consider cost, latency, filtering, and compliance requirements.
HyDE generates a hypothetical document that might contain factual errors. Why does this still improve retrieval quality? What are the risks?
Compare hybrid search (BM25 + vector) with pure vector search. When does keyword matching outperform semantic similarity, and vice versa?

RAG Pipeline Document Generator

Design and document a RAG pipeline architecture. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Pipeline Name *

Embedding Model *

Vector Database *

Chunking Strategy

Retriever Pattern

Data Sources

Advanced RAG Techniques

Additional Notes

Author Name

Conclusion & Next Steps

You now have a comprehensive understanding of Retrieval-Augmented Generation — the architecture that powers the majority of production AI applications. Here are the key takeaways from Part 5:

Embeddings transform text into semantic vectors — choose OpenAI for convenience, open-source models (BGE, MiniLM) for cost-free local inference
Vector databases each have their sweet spot — Chroma for prototyping, Qdrant/Weaviate for production self-hosted, Pinecone for fully managed, pgvector for PostgreSQL stacks
Document loading and text splitting are the unsung heroes of RAG quality — recursive character splitting with semantic awareness produces the best results
Retriever patterns — similarity for baseline, MMR for diversity, multi-query for robustness — can be combined for optimal results
Advanced RAG techniques (HyDE, RAG fusion, parent document retrieval) dramatically improve retrieval quality over naive approaches
The best RAG pipeline is one that matches your specific data characteristics, query patterns, latency requirements, and budget

Next in the Series

In Part 6: Memory & Context Engineering, we explore how to give AI applications persistent memory — buffer, summary, window, vector, and entity memory patterns — plus context engineering techniques including chunking strategies, re-ranking with Cohere and ColBERT, prompt compression, and context window management.

Cookie Consent

Cookie Preferences