Back to Technology

AI Application Development Mastery Part 15: Evaluation & LLMOps

April 1, 2026 Wasil Zafar 43 min read

Learn how to systematically evaluate LLM outputs for correctness, faithfulness, and relevance. Master RAGAS for RAG evaluation, LLM-as-judge patterns, observability with LangSmith and Langfuse, experiment tracking, CI/CD pipelines for LLM applications, and production cost tracking strategies.

Table of Contents

  1. Evaluation Foundations
  2. Evaluation Methods
  3. RAG Evaluation with RAGAS
  4. Agent Evaluation
  5. Observability & Tracing
  6. Experiment Tracking
  7. CI/CD for LLM Apps
  8. Cost Tracking & Optimization
  9. Exercises & Self-Assessment
  10. LLMOps Config Generator
  11. Conclusion & Next Steps

Introduction: You Cannot Improve What You Cannot Measure

Series Overview: This is Part 13 of our 18-part AI Application Development Mastery series. We now tackle one of the most critical and often overlooked aspects of AI app development — how to systematically evaluate LLM outputs, build observability into every layer, and operationalize your AI systems with LLMOps best practices.

AI Application Development Mastery

Your 20-step learning path • Currently on Step 15
1
Foundations & Evolution of AI Apps
Pre-LLM era, transformers, LLM revolution
2
LLM Fundamentals for Developers
Tokens, context windows, sampling, API patterns
3
Prompt Engineering Mastery
Zero/few-shot, CoT, ReAct, structured outputs
4
LangChain Core Concepts
Chains, prompts, LLMs, tools, LCEL
5
Retrieval-Augmented Generation (RAG)
Embeddings, vector DBs, retrievers, RAG pipelines
6
Memory & Context Engineering
Buffer/summary/vector memory, chunking, re-ranking
7
Agents — Core of Modern AI Apps
ReAct, tool-calling, planner-executor agents
8
LangGraph — Stateful Agent Workflows
Nodes, edges, state, graph execution, cycles
9
Deep Agents & Autonomous Systems
Multi-step reasoning, self-reflection, planning
10
Multi-Agent Systems
Supervisor, swarm, debate, role-based collaboration
11
AI Application Design Patterns
RAG, chat+memory, workflow automation, agent loops
12
Ecosystem & Frameworks
LlamaIndex, Haystack, HuggingFace, vLLM
13
MCP Foundations & Architecture
Protocol design, Host/Client/Server, primitives, security
14
MCP in Production
Building servers, integrations, scaling, agent systems
15
Evaluation & LLMOps
Prompt eval, tracing, LangSmith, experiment tracking
You Are Here
16
Production AI Systems
APIs, queues, caching, streaming, scaling
17
Safety, Guardrails & Reliability
Input filtering, hallucination mitigation, prompt injection
18
Advanced Topics
Fine-tuning, tool learning, hybrid LLM+symbolic
19
Building Real AI Applications
Chatbot, document QA, coding assistant, full-stack
20
Future of AI Applications
Autonomous agents, self-improving, multi-modal, AI OS

Building an AI application that works in a demo is easy. Building one that works reliably in production is an entirely different challenge. The difference between the two comes down to one word: evaluation. Without rigorous, systematic evaluation, you are flying blind — shipping prompts you hope work, deploying RAG pipelines you assume retrieve the right documents, and running agents you pray will not go off the rails.

LLMOps (Large Language Model Operations) is the discipline that brings the rigor of traditional MLOps — experiment tracking, CI/CD, monitoring, cost management — to the unique challenges of LLM-powered applications. Unlike classical ML where you optimize a single metric (accuracy, F1), LLM evaluation is multidimensional: a response can be factually correct but irrelevant, or relevant but unfaithful to the source material, or faithful but poorly written.

Key Insight: Evaluation is not a one-time activity — it is a continuous process baked into every stage of your development lifecycle. The best AI teams evaluate during development (offline eval), before deployment (CI/CD eval), and after deployment (online monitoring). This part teaches you how to build all three.

1. Evaluation Foundations

Evaluating LLM applications is fundamentally different from evaluating traditional software. Outputs are non-deterministic, quality is subjective, and there’s no single "correct" answer for most tasks. This section establishes the foundational concepts that underpin all LLM evaluation — why it matters, what makes it hard, and the core metrics and methodologies that form the basis of a rigorous evaluation strategy.

1.1 Why LLM Evaluation Matters

Traditional software has deterministic outputs — given the same input, you get the same output. LLMs are stochastic — the same prompt can produce different responses across runs. This fundamental difference means that traditional unit tests are necessary but insufficient. You need a new evaluation paradigm.

Case Study

The $2.4 Million Hallucination

In 2023, a Canadian airline's chatbot hallucinated a bereavement fare policy that did not exist. When a customer relied on this information and booked a flight, the airline was held legally liable. The tribunal ruled that the chatbot's output was binding — the airline could not disclaim responsibility for its own AI system. This case demonstrates why evaluation is not optional: untested LLM outputs create legal, financial, and reputational risk.

Hallucination Legal Liability Production Failure No Guardrails

1.2 Evaluation Taxonomy

LLM evaluation spans multiple dimensions and methods. Here is a comprehensive taxonomy:

Dimension What It Measures When to Use Example
Correctness Is the answer factually accurate? QA, knowledge retrieval, factual tasks "What is the capital of France?" should return "Paris"
Faithfulness Is the answer grounded in the provided context? RAG systems, summarization Answer only uses facts from retrieved documents
Relevance Does the answer address the user's question? All conversational AI Response about Python when asked about Python (not Java)
Coherence Is the answer well-structured and readable? Long-form generation, reports Logical flow, no contradictions within the response
Harmlessness Is the answer safe and non-toxic? All user-facing applications No harmful, biased, or offensive content
Helpfulness Does the answer actually help the user? Assistants, customer support Actionable advice vs. generic non-answers

1.3 Correctness, Faithfulness & Relevance — The Core Triad

These three dimensions form the foundation of LLM evaluation. Understanding their distinctions is critical:

# The Core Evaluation Triad — Illustrated with Examples

# SCENARIO: RAG system answering questions about company policies
# Retrieved Context: "Employees get 20 days of annual leave.
#                     Sick leave is 10 days per year."
# Question: "How many vacation days do I get?"

# --- CORRECTNESS ---
# Is the factual content accurate?
correct_answer = "You get 20 days of annual leave."  # Correct
incorrect_answer = "You get 30 days of annual leave."  # Incorrect (wrong number)

# --- FAITHFULNESS ---
# Is the answer grounded in the provided context?
faithful_answer = "You get 20 days of annual leave per year."  # Faithful
unfaithful_answer = "You get 20 days of annual leave plus 5 personal days."  # Unfaithful (5 personal days not in context)

# --- RELEVANCE ---
# Does the answer address the user's question?
relevant_answer = "You get 20 days of annual leave."  # Relevant (about vacation)
irrelevant_answer = "Sick leave is 10 days per year."  # Irrelevant (not what was asked)

# KEY INSIGHT: An answer can be:
# - Correct but irrelevant (factually true but answers wrong question)
# - Faithful but incorrect (only uses context, but context is wrong)
# - Relevant but unfaithful (addresses the question but hallucinates details)

# The best answers score high on ALL THREE dimensions
Key Insight: In RAG systems, faithfulness is often more important than correctness. If your retrieved documents contain incorrect information, a faithful answer will surface that incorrectness transparently — allowing you to fix the source data. An unfaithful answer that happens to be correct masks the underlying data quality issue and creates a false sense of reliability.

2. Evaluation Methods

LLM evaluation methods fall into three categories: human evaluation (gold standard but expensive and slow), automated metrics (fast and reproducible but limited in capturing nuance), and LLM-as-judge (using a stronger model to evaluate a weaker one). Each method has distinct strengths, and production systems typically combine all three in a layered evaluation strategy that balances cost, speed, and accuracy.

2.1 Human Evaluation

Human evaluation remains the gold standard for assessing LLM outputs, especially for subjective qualities like helpfulness and tone. However, it is expensive, slow, and does not scale.

Method Description Strengths Weaknesses
Likert Scale Rate outputs 1-5 on dimensions (quality, relevance, etc.) Easy to aggregate, statistical analysis Subjective, inter-rater disagreement
Pairwise Comparison Show two outputs, pick the better one Simpler decision, higher agreement O(n^2) comparisons, no absolute scores
Binary Pass/Fail Is this output acceptable? Yes/No Fast, unambiguous, high inter-rater reliability No nuance, misses quality gradations
Rubric-Based Detailed scoring rubric with explicit criteria Consistent, reduces subjectivity Time to create rubric, rigidity
# Human evaluation framework with structured rubrics
from dataclasses import dataclass, field
from typing import List, Optional
from enum import Enum
import json

class EvalDimension(Enum):
    CORRECTNESS = "correctness"
    FAITHFULNESS = "faithfulness"
    RELEVANCE = "relevance"
    COHERENCE = "coherence"
    HELPFULNESS = "helpfulness"

@dataclass
class HumanEvalRubric:
    """Structured rubric for human evaluation of LLM outputs."""
    dimension: EvalDimension
    scale: dict  # score -> description

    @staticmethod
    def correctness_rubric():
        return HumanEvalRubric(
            dimension=EvalDimension.CORRECTNESS,
            scale={
                1: "Completely incorrect — factual errors throughout",
                2: "Mostly incorrect — major factual errors",
                3: "Partially correct — mix of correct and incorrect facts",
                4: "Mostly correct — minor inaccuracies only",
                5: "Fully correct — all facts verified and accurate"
            }
        )

    @staticmethod
    def faithfulness_rubric():
        return HumanEvalRubric(
            dimension=EvalDimension.FAITHFULNESS,
            scale={
                1: "Completely unfaithful — fabricates information not in context",
                2: "Mostly unfaithful — significant hallucinated content",
                3: "Mixed — some grounded, some hallucinated",
                4: "Mostly faithful — minor unsupported claims",
                5: "Fully faithful — every claim traceable to source"
            }
        )

@dataclass
class HumanEvalResult:
    """Single human evaluation result."""
    evaluator_id: str
    sample_id: str
    question: str
    context: Optional[str]
    llm_response: str
    scores: dict  # dimension -> score
    reasoning: str
    timestamp: str

@dataclass
class HumanEvalSession:
    """Manages a complete human evaluation session."""
    eval_name: str
    rubrics: List[HumanEvalRubric]
    results: List[HumanEvalResult] = field(default_factory=list)

    def add_result(self, result: HumanEvalResult):
        self.results.append(result)

    def inter_rater_agreement(self, dimension: str) -> float:
        """Calculate Cohen's kappa for inter-rater agreement."""
        # Group results by sample_id
        from collections import defaultdict
        sample_scores = defaultdict(list)
        for r in self.results:
            if dimension in r.scores:
                sample_scores[r.sample_id].append(r.scores[dimension])

        # Need at least 2 ratings per sample
        paired = {k: v for k, v in sample_scores.items() if len(v) >= 2}
        if not paired:
            return 0.0

        agreements = sum(1 for scores in paired.values() if scores[0] == scores[1])
        return agreements / len(paired)

    def dimension_summary(self, dimension: str) -> dict:
        """Get summary statistics for a dimension."""
        scores = [r.scores[dimension] for r in self.results if dimension in r.scores]
        if not scores:
            return {"count": 0, "mean": 0, "min": 0, "max": 0}
        return {
            "count": len(scores),
            "mean": round(sum(scores) / len(scores), 2),
            "min": min(scores),
            "max": max(scores),
            "pass_rate": round(sum(1 for s in scores if s >= 4) / len(scores) * 100, 1)
        }

2.2 LLM-as-Judge

The LLM-as-Judge pattern uses a powerful LLM (typically GPT-4 or Claude) to evaluate the outputs of another LLM. This approach has become the dominant evaluation method in production because it scales infinitely, is consistent, and correlates well with human judgments when properly designed.

# LLM-as-Judge — Production-grade implementation
# pip install langchain-openai pydantic

import os
import json
from typing import Literal
from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate
from pydantic import BaseModel, Field

# Requires OPENAI_API_KEY environment variable
# export OPENAI_API_KEY="sk-..."

# --- Structured Output for Judge ---
class CorrectnessScore(BaseModel):
    """Structured evaluation of correctness."""
    score: Literal[1, 2, 3, 4, 5] = Field(
        description="1=completely wrong, 5=fully correct"
    )
    reasoning: str = Field(
        description="Step-by-step reasoning for the score"
    )
    errors_found: list[str] = Field(
        default_factory=list,
        description="List of specific errors identified"
    )

class FaithfulnessScore(BaseModel):
    """Structured evaluation of faithfulness to source context."""
    score: Literal[1, 2, 3, 4, 5] = Field(
        description="1=completely hallucinated, 5=fully grounded"
    )
    reasoning: str = Field(
        description="Step-by-step reasoning for the score"
    )
    hallucinated_claims: list[str] = Field(
        default_factory=list,
        description="Claims not supported by the provided context"
    )
    grounded_claims: list[str] = Field(
        default_factory=list,
        description="Claims that are supported by the context"
    )

class RelevanceScore(BaseModel):
    """Structured evaluation of relevance."""
    score: Literal[1, 2, 3, 4, 5] = Field(
        description="1=completely irrelevant, 5=perfectly relevant"
    )
    reasoning: str = Field(
        description="Step-by-step reasoning for the score"
    )

# --- Judge Prompts ---
CORRECTNESS_JUDGE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are an expert evaluator. Assess the CORRECTNESS of
the AI response to the given question.

Correctness measures whether the response contains factually accurate
information. Use your knowledge to verify claims.

Scoring rubric:
1 = Completely incorrect — major factual errors throughout
2 = Mostly incorrect — significant factual errors
3 = Partially correct — mix of correct and incorrect facts
4 = Mostly correct — only minor inaccuracies
5 = Fully correct — all verifiable facts are accurate

Be strict but fair. Identify specific errors."""),
    ("human", """Question: {question}

AI Response: {response}

Reference Answer (if available): {reference}

Evaluate the correctness of this response.""")
])

FAITHFULNESS_JUDGE_PROMPT = ChatPromptTemplate.from_messages([
    ("system", """You are an expert evaluator. Assess the FAITHFULNESS of
the AI response to the provided context.

Faithfulness measures whether EVERY claim in the response can be
traced back to the provided context. A faithful response does not
add information beyond what the context supports.

Scoring rubric:
1 = Completely unfaithful — fabricates information not in context
2 = Mostly unfaithful — significant hallucinated content
3 = Mixed — some claims grounded, some hallucinated
4 = Mostly faithful — minor unsupported details
5 = Fully faithful — every claim traceable to the context

List each claim and whether it is grounded or hallucinated."""),
    ("human", """Context: {context}

Question: {question}

AI Response: {response}

Evaluate the faithfulness of this response to the provided context.""")
])

# --- Judge Implementation ---
class LLMJudge:
    """Production LLM-as-Judge evaluator."""

    def __init__(self, model: str = "gpt-4o", temperature: float = 0.0):
        self.llm = ChatOpenAI(model=model, temperature=temperature)

    def evaluate_correctness(
        self, question: str, response: str, reference: str = "N/A"
    ) -> CorrectnessScore:
        """Evaluate correctness of a response."""
        chain = CORRECTNESS_JUDGE_PROMPT | self.llm.with_structured_output(
            CorrectnessScore
        )
        return chain.invoke({
            "question": question,
            "response": response,
            "reference": reference
        })

    def evaluate_faithfulness(
        self, context: str, question: str, response: str
    ) -> FaithfulnessScore:
        """Evaluate faithfulness of a response to its source context."""
        chain = FAITHFULNESS_JUDGE_PROMPT | self.llm.with_structured_output(
            FaithfulnessScore
        )
        return chain.invoke({
            "context": context,
            "question": question,
            "response": response
        })

    def batch_evaluate(
        self, samples: list[dict], eval_type: str = "correctness"
    ) -> list[dict]:
        """Evaluate a batch of samples."""
        results = []
        for sample in samples:
            if eval_type == "correctness":
                score = self.evaluate_correctness(
                    sample["question"], sample["response"],
                    sample.get("reference", "N/A")
                )
            elif eval_type == "faithfulness":
                score = self.evaluate_faithfulness(
                    sample["context"], sample["question"],
                    sample["response"]
                )
            results.append({
                **sample,
                "eval_type": eval_type,
                "score": score.score,
                "reasoning": score.reasoning
            })
        return results

# Usage
judge = LLMJudge(model="gpt-4o")

result = judge.evaluate_faithfulness(
    context="Our company offers 20 days of PTO and 10 sick days annually.",
    question="How much time off do employees get?",
    response="Employees receive 20 days of PTO, 10 sick days, and 5 personal days."
)
print(f"Faithfulness: {result.score}/5")
print(f"Hallucinated: {result.hallucinated_claims}")
# => Faithfulness: 3/5
# => Hallucinated: ["5 personal days — not mentioned in context"]
Common Mistake: Using the same model to both generate and judge outputs creates a self-evaluation bias. The judge model tends to rate its own outputs more favorably. Always use a different model as the judge (e.g., use GPT-4o to judge Claude outputs, or vice versa), or use an even more capable model (e.g., GPT-4o judging GPT-4o-mini outputs).

2.3 Automated Metrics

Automated metrics provide fast, deterministic, and reproducible evaluation scores. While less nuanced than human or LLM-as-judge evaluation, they are essential for CI/CD pipelines and regression testing.

# Automated evaluation metrics for LLM outputs
# pip install openai numpy  (only needed for semantic_similarity)

import os
import re
import math
from collections import Counter

# semantic_similarity() requires OPENAI_API_KEY environment variable
# export OPENAI_API_KEY="sk-..."

class AutomatedMetrics:
    """Collection of automated evaluation metrics."""

    @staticmethod
    def exact_match(prediction: str, reference: str) -> float:
        """Exact string match after normalization."""
        def normalize(text):
            return re.sub(r'\s+', ' ', text.lower().strip())
        return 1.0 if normalize(prediction) == normalize(reference) else 0.0

    @staticmethod
    def f1_score(prediction: str, reference: str) -> float:
        """Token-level F1 score between prediction and reference."""
        pred_tokens = prediction.lower().split()
        ref_tokens = reference.lower().split()

        common = Counter(pred_tokens) & Counter(ref_tokens)
        num_common = sum(common.values())

        if num_common == 0:
            return 0.0

        precision = num_common / len(pred_tokens)
        recall = num_common / len(ref_tokens)
        return 2 * (precision * recall) / (precision + recall)

    @staticmethod
    def bleu_score(prediction: str, reference: str, n: int = 4) -> float:
        """Simplified BLEU score (unigram to n-gram)."""
        pred_tokens = prediction.lower().split()
        ref_tokens = reference.lower().split()

        scores = []
        for i in range(1, n + 1):
            pred_ngrams = [tuple(pred_tokens[j:j+i]) for j in range(len(pred_tokens)-i+1)]
            ref_ngrams = [tuple(ref_tokens[j:j+i]) for j in range(len(ref_tokens)-i+1)]

            if not pred_ngrams:
                scores.append(0)
                continue

            matches = sum(1 for ng in pred_ngrams if ng in ref_ngrams)
            scores.append(matches / len(pred_ngrams))

        if 0 in scores:
            return 0.0

        log_avg = sum(math.log(s) for s in scores) / len(scores)
        brevity = min(1.0, len(pred_tokens) / len(ref_tokens))
        return brevity * math.exp(log_avg)

    @staticmethod
    def rouge_l(prediction: str, reference: str) -> float:
        """ROUGE-L (Longest Common Subsequence) score."""
        pred_tokens = prediction.lower().split()
        ref_tokens = reference.lower().split()

        m, n = len(pred_tokens), len(ref_tokens)
        dp = [[0] * (n + 1) for _ in range(m + 1)]

        for i in range(1, m + 1):
            for j in range(1, n + 1):
                if pred_tokens[i-1] == ref_tokens[j-1]:
                    dp[i][j] = dp[i-1][j-1] + 1
                else:
                    dp[i][j] = max(dp[i-1][j], dp[i][j-1])

        lcs = dp[m][n]
        if lcs == 0:
            return 0.0

        precision = lcs / m
        recall = lcs / n
        return 2 * precision * recall / (precision + recall)

    @staticmethod
    def semantic_similarity(
        prediction: str, reference: str, model: str = "text-embedding-3-small"
    ) -> float:
        """Cosine similarity between embeddings (requires OpenAI)."""
        from openai import OpenAI
        import numpy as np

        client = OpenAI()
        resp = client.embeddings.create(
            input=[prediction, reference], model=model
        )

        emb1 = np.array(resp.data[0].embedding)
        emb2 = np.array(resp.data[1].embedding)

        return float(np.dot(emb1, emb2) / (np.linalg.norm(emb1) * np.linalg.norm(emb2)))

# Usage comparison
metrics = AutomatedMetrics()
pred = "The capital of France is Paris, a beautiful city on the Seine."
ref = "Paris is the capital of France."

print(f"Exact Match: {metrics.exact_match(pred, ref)}")     # 0.0
print(f"F1 Score:    {metrics.f1_score(pred, ref)}")         # ~0.5
print(f"ROUGE-L:     {metrics.rouge_l(pred, ref):.3f}")      # ~0.46
print(f"BLEU:        {metrics.bleu_score(pred, ref):.3f}")   # ~0.28
Metric Best For Limitations Cost
Exact Match Factoid QA, classification Too strict for open-ended answers Free
F1 Score Extractive QA, named entities Ignores word order and meaning Free
BLEU/ROUGE Summarization, translation N-gram overlap misses semantics Free
Semantic Similarity Paraphrase detection, open-ended Requires embedding model, API cost Low ($)
LLM-as-Judge Any evaluation dimension LLM cost, potential bias Medium ($$)

3. RAG Evaluation with RAGAS

RAGAS (Retrieval Augmented Generation Assessment) is the leading framework for evaluating RAG pipelines. It provides a suite of metrics that independently assess the retrieval and generation components of your pipeline.

3.1 RAGAS Metrics Deep Dive

RAGAS Metric What It Measures Evaluates Range
Faithfulness Are generated claims supported by retrieved context? Generator 0.0 - 1.0
Answer Relevance Does the answer address the original question? Generator 0.0 - 1.0
Context Precision Are relevant documents ranked higher in retrieved results? Retriever 0.0 - 1.0
Context Recall Do retrieved documents cover all needed information? Retriever 0.0 - 1.0
Context Relevancy Are retrieved documents relevant to the question? Retriever 0.0 - 1.0
Answer Correctness Is the answer factually correct vs ground truth? End-to-end 0.0 - 1.0

3.2 RAGAS Implementation

RAGAS (Retrieval-Augmented Generation Assessment) is the standard framework for evaluating RAG pipelines end-to-end. It measures four key dimensions: faithfulness (does the answer stick to retrieved facts?), answer relevancy (does it address the question?), context precision (are the retrieved documents relevant?), and context recall (were all necessary documents found?). The following implementation shows how to prepare evaluation datasets and run RAGAS scoring against your RAG pipeline.

# RAGAS — Evaluating a RAG pipeline end-to-end
# pip install ragas langchain-openai datasets

import os
from ragas import evaluate

# Requires OPENAI_API_KEY environment variable (RAGAS uses OpenAI for evaluation)
# export OPENAI_API_KEY="sk-..."
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
    context_relevancy,
    answer_correctness
)
from datasets import Dataset

# Prepare your evaluation dataset
# Each sample needs: question, answer, contexts, ground_truth
eval_data = {
    "question": [
        "What is the company's remote work policy?",
        "How do I request time off?",
        "What are the health insurance options?",
    ],
    "answer": [
        "Employees can work remotely up to 3 days per week with manager approval.",
        "Submit a time-off request through the HR portal at least 2 weeks in advance.",
        "We offer three health plans: Basic HMO, Standard PPO, and Premium PPO.",
    ],
    "contexts": [
        [
            "Remote Work Policy: Employees may work from home up to 3 days per week. "
            "Manager approval is required. Fully remote arrangements require VP approval.",
            "Office hours are 9 AM to 5 PM. Core collaboration hours are 10 AM to 3 PM."
        ],
        [
            "Time-Off Requests: All PTO requests must be submitted via the HR portal. "
            "Requests should be made at least 2 weeks before the requested dates.",
            "Emergency leave can be requested retroactively within 3 business days."
        ],
        [
            "Health Insurance: The company offers Basic HMO ($0/month), Standard PPO "
            "($50/month), and Premium PPO ($150/month) plans.",
            "Dental and vision coverage are included in Standard and Premium plans."
        ],
    ],
    "ground_truth": [
        "Employees can work remotely up to 3 days per week with manager approval. Fully remote requires VP approval.",
        "Time-off requests are submitted through the HR portal at least 2 weeks in advance. Emergency leave can be requested retroactively within 3 days.",
        "Three health plans are available: Basic HMO ($0/month), Standard PPO ($50/month), and Premium PPO ($150/month).",
    ]
}

dataset = Dataset.from_dict(eval_data)

# Run RAGAS evaluation
results = evaluate(
    dataset=dataset,
    metrics=[
        faithfulness,
        answer_relevancy,
        context_precision,
        context_recall,
        answer_correctness
    ]
)

# Print overall scores
print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.95,
#  'context_precision': 0.88, 'context_recall': 0.85,
#  'answer_correctness': 0.90}

# Convert to pandas for detailed analysis
df = results.to_pandas()
print(df[['question', 'faithfulness', 'answer_relevancy', 'context_precision']])

3.3 End-to-End RAG Eval Pipeline

For production RAG systems, you need a reusable evaluation pipeline that can run against any configuration and produce comparable results. The pipeline below wraps RAGAS with configurable thresholds, automated pass/fail decisions, and structured reports. This makes RAG evaluation a repeatable, automatable step in your deployment process rather than an ad hoc manual check.

# Production RAG evaluation pipeline
# Integrates RAGAS with custom metrics and reporting
# pip install ragas datasets

import os
import json
from dataclasses import dataclass
from typing import Optional
from datetime import datetime
from datasets import Dataset
from ragas import evaluate
from ragas.metrics import (
    faithfulness, answer_relevancy,
    context_precision, context_recall, answer_correctness
)

# Requires OPENAI_API_KEY environment variable
# export OPENAI_API_KEY="sk-..."

@dataclass
class RAGEvalConfig:
    """Configuration for RAG evaluation pipeline."""
    eval_name: str
    rag_pipeline_version: str
    embedding_model: str
    llm_model: str
    chunk_size: int
    chunk_overlap: int
    top_k: int
    eval_dataset_path: str
    metrics: list[str]
    thresholds: dict[str, float]  # metric -> minimum acceptable score

class RAGEvalPipeline:
    """End-to-end RAG evaluation pipeline."""

    def __init__(self, config: RAGEvalConfig):
        self.config = config
        self.results = {}

    def load_eval_dataset(self) -> Dataset:
        """Load and validate the evaluation dataset."""
        with open(self.config.eval_dataset_path) as f:
            data = json.load(f)

        required_keys = {"question", "answer", "contexts", "ground_truth"}
        assert required_keys.issubset(data.keys()), \
            f"Dataset missing keys: {required_keys - data.keys()}"

        return Dataset.from_dict(data)

    def run_evaluation(self) -> dict:
        """Run the complete evaluation pipeline."""
        dataset = self.load_eval_dataset()

        # Map metric names to RAGAS metric objects
        metric_map = {
            "faithfulness": faithfulness,
            "answer_relevancy": answer_relevancy,
            "context_precision": context_precision,
            "context_recall": context_recall,
            "answer_correctness": answer_correctness,
        }

        metrics = [metric_map[m] for m in self.config.metrics if m in metric_map]

        results = evaluate(dataset=dataset, metrics=metrics)
        self.results = dict(results)
        return self.results

    def check_thresholds(self) -> dict:
        """Check if evaluation results meet minimum thresholds."""
        failures = {}
        for metric, threshold in self.config.thresholds.items():
            if metric in self.results:
                actual = self.results[metric]
                if actual < threshold:
                    failures[metric] = {
                        "expected": threshold,
                        "actual": actual,
                        "gap": round(threshold - actual, 4)
                    }
        return failures

    def generate_report(self) -> dict:
        """Generate a comprehensive evaluation report."""
        failures = self.check_thresholds()
        return {
            "eval_name": self.config.eval_name,
            "timestamp": datetime.now().isoformat(),
            "config": {
                "pipeline_version": self.config.rag_pipeline_version,
                "embedding_model": self.config.embedding_model,
                "llm_model": self.config.llm_model,
                "chunk_size": self.config.chunk_size,
                "top_k": self.config.top_k
            },
            "scores": self.results,
            "thresholds": self.config.thresholds,
            "failures": failures,
            "passed": len(failures) == 0,
            "summary": "ALL CHECKS PASSED" if not failures
                       else f"FAILED: {list(failures.keys())}"
        }

# Usage
config = RAGEvalConfig(
    eval_name="hr-chatbot-v2.1-eval",
    rag_pipeline_version="2.1.0",
    embedding_model="text-embedding-3-small",
    llm_model="gpt-4o-mini",
    chunk_size=512,
    chunk_overlap=50,
    top_k=5,
    eval_dataset_path="eval_data/hr_chatbot_eval.json",
    metrics=["faithfulness", "answer_relevancy", "context_precision", "answer_correctness"],
    thresholds={
        "faithfulness": 0.85,
        "answer_relevancy": 0.80,
        "context_precision": 0.75,
        "answer_correctness": 0.80
    }
)
Key Insight: The most common RAG failure mode is not the LLM generating bad answers — it is the retriever fetching irrelevant documents. Always evaluate retrieval quality (context precision, context recall) independently from generation quality (faithfulness, answer relevance). If your context precision is low, no amount of prompt engineering will fix your answers.

4. Agent Evaluation

Evaluating autonomous agents adds layers of complexity beyond single-call LLM evaluation. Agents make sequences of decisions, use tools, and their final output depends on the entire trajectory of actions — not just the last step. This section covers the unique challenges of agent evaluation, practical strategies for assessing both outcomes and reasoning paths, and frameworks for measuring agent reliability across diverse scenarios.

4.1 Challenges of Evaluating Agents

Evaluating agents is fundamentally harder than evaluating simple LLM calls because agents make sequences of decisions with branching paths. The same task can be completed via different tool sequences, and the quality of intermediate steps matters as much as the final answer.

Challenge Description Mitigation
Non-determinism Same task, different tool call sequences each run Evaluate outcomes, not paths; run N times
Multi-step Errors compound — one bad step derails everything Evaluate each step + final outcome independently
Tool interactions Correct tool selection is as important as correct arguments Track tool selection accuracy as a separate metric
Cost explosion Agents that loop excessively burn tokens Track token/step counts, set hard limits

4.2 Agent Eval Strategies

Agent evaluation requires assessing both the final outcome (did the agent produce the correct result?) and the trajectory (did it take a reasonable path to get there?). The framework below captures agent trajectories as structured data — tracking which tools were called, in what order, and with what arguments — then scores them against configurable criteria including correctness, efficiency (step count), tool appropriateness, and reasoning quality.

# Agent evaluation framework
from dataclasses import dataclass, field
from typing import Any, Optional

@dataclass
class AgentTrajectory:
    """Records the full trajectory of an agent execution."""
    task: str
    steps: list[dict] = field(default_factory=list)
    final_answer: Optional[str] = None
    total_tokens: int = 0
    total_cost: float = 0.0
    total_steps: int = 0
    success: bool = False

    def add_step(self, step_type: str, content: str,
                 tool_name: str = None, tool_args: dict = None,
                 tool_result: str = None, tokens: int = 0):
        self.steps.append({
            "step_num": len(self.steps) + 1,
            "type": step_type,
            "content": content,
            "tool_name": tool_name,
            "tool_args": tool_args,
            "tool_result": tool_result,
            "tokens": tokens
        })
        self.total_tokens += tokens
        self.total_steps = len(self.steps)

class AgentEvaluator:
    """Evaluates agent trajectories across multiple dimensions."""

    def __init__(self, judge_llm=None):
        self.judge = judge_llm

    def evaluate_task_completion(
        self, trajectory: AgentTrajectory, expected_answer: str
    ) -> dict:
        """Did the agent complete the task correctly?"""
        if not trajectory.final_answer:
            return {"score": 0, "reason": "No final answer produced"}

        # Use LLM-as-judge for semantic comparison
        # (simplified — would use the LLMJudge class from Section 2)
        return {
            "score": 1 if trajectory.success else 0,
            "final_answer": trajectory.final_answer,
            "expected": expected_answer
        }

    def evaluate_tool_selection(
        self, trajectory: AgentTrajectory,
        expected_tools: list[str]
    ) -> dict:
        """Did the agent use the right tools?"""
        used_tools = [
            s["tool_name"] for s in trajectory.steps
            if s["tool_name"] is not None
        ]

        expected_set = set(expected_tools)
        used_set = set(used_tools)

        correct = expected_set & used_set
        missed = expected_set - used_set
        unnecessary = used_set - expected_set

        precision = len(correct) / len(used_set) if used_set else 0
        recall = len(correct) / len(expected_set) if expected_set else 0

        return {
            "precision": round(precision, 2),
            "recall": round(recall, 2),
            "correct_tools": list(correct),
            "missed_tools": list(missed),
            "unnecessary_tools": list(unnecessary)
        }

    def evaluate_efficiency(
        self, trajectory: AgentTrajectory,
        max_steps: int = 10, max_tokens: int = 50000
    ) -> dict:
        """Was the agent efficient in completing the task?"""
        step_ratio = trajectory.total_steps / max_steps
        token_ratio = trajectory.total_tokens / max_tokens

        return {
            "total_steps": trajectory.total_steps,
            "max_steps": max_steps,
            "step_efficiency": round(1 - min(step_ratio, 1), 2),
            "total_tokens": trajectory.total_tokens,
            "token_efficiency": round(1 - min(token_ratio, 1), 2),
            "estimated_cost": round(trajectory.total_cost, 4)
        }

    def full_evaluation(
        self, trajectory: AgentTrajectory,
        expected_answer: str, expected_tools: list[str]
    ) -> dict:
        """Complete agent evaluation."""
        return {
            "task": trajectory.task,
            "completion": self.evaluate_task_completion(trajectory, expected_answer),
            "tool_selection": self.evaluate_tool_selection(trajectory, expected_tools),
            "efficiency": self.evaluate_efficiency(trajectory),
            "trajectory_length": trajectory.total_steps
        }

5. Observability & Tracing

Observability in LLM applications means being able to see every LLM call, every retrieval, every tool invocation, and every token spent across your entire pipeline. Without it, debugging production issues is like reading tea leaves.

5.1 LangSmith

LangSmith is the observability platform built by LangChain. It provides end-to-end tracing, evaluation, dataset management, and prompt playground capabilities.

# LangSmith — Setting up tracing for your LangChain app
# pip install langsmith langchain-openai

import os

# Enable LangSmith tracing (set these environment variables)
# export LANGCHAIN_API_KEY="ls_..."  (get from smith.langchain.com)
# export OPENAI_API_KEY="sk-..."
os.environ["LANGCHAIN_TRACING_V2"] = "true"
os.environ["LANGCHAIN_API_KEY"] = os.getenv("LANGCHAIN_API_KEY", "ls_your_api_key_here")
os.environ["LANGCHAIN_PROJECT"] = "hr-chatbot-production"

# That's it! All LangChain operations are now traced automatically.
# Every chain invocation, LLM call, retriever query, and tool call
# appears in the LangSmith dashboard with full input/output logging.

from langchain_openai import ChatOpenAI
from langchain_core.prompts import ChatPromptTemplate

llm = ChatOpenAI(model="gpt-4o-mini", temperature=0)
prompt = ChatPromptTemplate.from_messages([
    ("system", "You are a helpful HR assistant."),
    ("human", "{question}")
])

chain = prompt | llm

# This call is automatically traced in LangSmith
response = chain.invoke({"question": "What is our PTO policy?"})

# --- LangSmith Evaluation with Datasets ---
from langsmith import Client
from langsmith.evaluation import evaluate

client = Client()

# Create an evaluation dataset
dataset = client.create_dataset("hr-chatbot-eval-v1")

# Add examples to the dataset
client.create_examples(
    inputs=[
        {"question": "How many vacation days do I get?"},
        {"question": "What is the dress code?"},
        {"question": "How do I submit expenses?"},
    ],
    outputs=[
        {"answer": "20 days of annual PTO"},
        {"answer": "Business casual Monday-Thursday, casual Friday"},
        {"answer": "Submit through Expensify within 30 days"},
    ],
    dataset_id=dataset.id
)

# Define your evaluator
def correctness_evaluator(run, example):
    """Custom evaluator that checks answer correctness."""
    prediction = run.outputs.get("output", "")
    reference = example.outputs.get("answer", "")

    # Simple check — in production, use LLM-as-judge
    score = 1.0 if reference.lower() in prediction.lower() else 0.0
    return {"key": "correctness", "score": score}

# Run evaluation
results = evaluate(
    lambda inputs: chain.invoke(inputs),
    data="hr-chatbot-eval-v1",
    evaluators=[correctness_evaluator],
    experiment_prefix="hr-chatbot-gpt4o-mini"
)

5.2 Langfuse

Langfuse is an open-source LLM observability platform that provides tracing, analytics, evaluation, and prompt management. It is the leading open-source alternative to LangSmith.

# Langfuse — Open-source LLM observability
# pip install langfuse openai

import os
from langfuse import Langfuse
from langfuse.decorators import observe, langfuse_context

# Set API keys via environment variables
# export LANGFUSE_PUBLIC_KEY="pk-..."
# export LANGFUSE_SECRET_KEY="sk-..."
# export OPENAI_API_KEY="sk-..."  (used by generate_answer below)

# Initialize Langfuse client
langfuse = Langfuse(
    public_key=os.getenv("LANGFUSE_PUBLIC_KEY", "pk-your-key"),
    secret_key=os.getenv("LANGFUSE_SECRET_KEY", "sk-your-key"),
    host="https://cloud.langfuse.com"  # or self-hosted URL
)

# --- Decorator-based tracing ---
@observe()
def rag_pipeline(question: str) -> str:
    """Full RAG pipeline with automatic Langfuse tracing."""

    # Step 1: Retrieve documents
    docs = retrieve_documents(question)

    # Step 2: Generate answer
    answer = generate_answer(question, docs)

    # Add metadata and scores
    langfuse_context.update_current_observation(
        metadata={"retriever": "chroma", "top_k": 5},
    )
    langfuse_context.score_current_trace(
        name="user_feedback",
        value=1,
        comment="User found the answer helpful"
    )

    return answer

@observe()
def retrieve_documents(question: str) -> list:
    """Retrieve relevant documents — traced as a span."""
    # Your retrieval logic here
    langfuse_context.update_current_observation(
        metadata={"num_results": 5, "similarity_threshold": 0.7}
    )
    return ["doc1 content", "doc2 content"]

@observe(as_type="generation")
def generate_answer(question: str, docs: list) -> str:
    """Generate answer — traced as an LLM generation."""
    from openai import OpenAI
    client = OpenAI()

    context = "\n".join(docs)
    response = client.chat.completions.create(
        model="gpt-4o-mini",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": question}
        ]
    )

    result = response.choices[0].message.content

    # Log token usage and cost
    langfuse_context.update_current_observation(
        usage={
            "input": response.usage.prompt_tokens,
            "output": response.usage.completion_tokens,
            "total": response.usage.total_tokens
        },
        model="gpt-4o-mini"
    )

    return result

# --- Langfuse Prompt Management ---
# Store and version prompts in Langfuse
langfuse.create_prompt(
    name="hr-assistant-system",
    prompt="You are a helpful HR assistant for {{company_name}}. "
           "Answer based only on the provided context. "
           "If unsure, say 'I don't have that information.'",
    config={"model": "gpt-4o-mini", "temperature": 0.1},
    labels=["production"]
)

# Fetch the latest production prompt
prompt = langfuse.get_prompt("hr-assistant-system", label="production")
compiled = prompt.compile(company_name="Acme Corp")

5.3 Arize Phoenix

Arize Phoenix is an open-source observability tool focused on LLM trace visualization, embedding analysis, and evaluation. It excels at providing a visual, notebook-friendly experience for debugging LLM pipelines.

# Arize Phoenix — Visual LLM observability
# pip install arize-phoenix openinference-instrumentation-openai openai

import os
import phoenix as px
from openinference.instrumentation.openai import OpenAIInstrumentor
from phoenix.otel import register

# Requires OPENAI_API_KEY environment variable
# export OPENAI_API_KEY="sk-..."

# Launch Phoenix UI (local dashboard)
px.launch_app()

# Set up OpenTelemetry tracing
tracer_provider = register(project_name="hr-chatbot")

# Instrument OpenAI calls — all API calls are now traced
OpenAIInstrumentor().instrument(tracer_provider=tracer_provider)

# Now every OpenAI call is visible in the Phoenix dashboard
from openai import OpenAI
client = OpenAI()

response = client.chat.completions.create(
    model="gpt-4o-mini",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Explain RAGAS in 3 sentences."}
    ]
)

# View traces at http://localhost:6006
# Phoenix shows: latency, token counts, cost, full I/O
Feature LangSmith Langfuse Arize Phoenix
Hosting Cloud (managed) Cloud + self-hosted Local + cloud
Open Source No (proprietary) Yes (MIT) Yes (Apache 2.0)
LangChain Integration Native (first-party) Official callback handler OpenTelemetry-based
Prompt Management Yes (hub + playground) Yes (versioned prompts) No
Evaluation Built-in + custom Built-in + custom Built-in + RAGAS integration
Best For LangChain-heavy teams Open-source, self-hosted needs Notebook-based debugging

6. Experiment Tracking

LLM development is inherently experimental — you’re constantly testing different prompts, models, temperatures, and retrieval configurations. Without systematic experiment tracking, it’s impossible to know which changes actually improved performance. This section covers how to design rigorous LLM experiments with proper controls, and how to run A/B tests that produce statistically valid results for comparing model configurations in production.

6.1 Designing LLM Experiments

LLM development is inherently experimental. Every change to a prompt, retrieval strategy, model, or hyperparameter could improve or degrade performance. Without structured experiment tracking, you lose the ability to reproduce results and make data-driven decisions.

# LLM Experiment Tracking Framework
from dataclasses import dataclass, field
from datetime import datetime
from typing import Any, Optional
import json
import hashlib

@dataclass
class LLMExperiment:
    """Track a single LLM experiment with all its parameters."""
    name: str
    description: str

    # Model config
    model: str
    temperature: float
    max_tokens: int

    # Prompt config
    system_prompt: str
    prompt_template: str

    # RAG config (optional)
    embedding_model: Optional[str] = None
    chunk_size: Optional[int] = None
    top_k: Optional[int] = None

    # Results
    metrics: dict = field(default_factory=dict)
    cost: float = 0.0
    latency_p50: float = 0.0
    latency_p95: float = 0.0

    # Metadata
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
    tags: list[str] = field(default_factory=list)

    @property
    def config_hash(self) -> str:
        """Generate a unique hash for this experiment's config."""
        config = {
            "model": self.model,
            "temperature": self.temperature,
            "system_prompt": self.system_prompt,
            "prompt_template": self.prompt_template,
            "embedding_model": self.embedding_model,
            "chunk_size": self.chunk_size,
            "top_k": self.top_k
        }
        return hashlib.md5(json.dumps(config, sort_keys=True).encode()).hexdigest()[:8]

class ExperimentTracker:
    """Manages experiment history and comparisons."""

    def __init__(self, project_name: str):
        self.project_name = project_name
        self.experiments: list[LLMExperiment] = []

    def log_experiment(self, experiment: LLMExperiment):
        """Log a completed experiment."""
        self.experiments.append(experiment)
        print(f"Logged experiment: {experiment.name} [{experiment.config_hash}]")

    def compare(self, exp_names: list[str], metric: str) -> dict:
        """Compare experiments by a specific metric."""
        results = {}
        for exp in self.experiments:
            if exp.name in exp_names and metric in exp.metrics:
                results[exp.name] = {
                    "value": exp.metrics[metric],
                    "model": exp.model,
                    "config_hash": exp.config_hash
                }
        return dict(sorted(results.items(), key=lambda x: x[1]["value"], reverse=True))

    def best_experiment(self, metric: str) -> Optional[LLMExperiment]:
        """Find the best experiment for a given metric."""
        valid = [e for e in self.experiments if metric in e.metrics]
        if not valid:
            return None
        return max(valid, key=lambda e: e.metrics[metric])

# Usage
tracker = ExperimentTracker("hr-chatbot")

# Experiment 1: GPT-4o-mini with basic prompt
exp1 = LLMExperiment(
    name="baseline-gpt4o-mini",
    description="Baseline with GPT-4o-mini and simple prompt",
    model="gpt-4o-mini",
    temperature=0.1,
    max_tokens=500,
    system_prompt="You are a helpful HR assistant.",
    prompt_template="Context: {context}\nQuestion: {question}\nAnswer:",
    embedding_model="text-embedding-3-small",
    chunk_size=512,
    top_k=5,
    metrics={"faithfulness": 0.82, "relevance": 0.88, "correctness": 0.79},
    cost=0.12,
    latency_p50=1.2
)
tracker.log_experiment(exp1)

# Experiment 2: Same model, better prompt
exp2 = LLMExperiment(
    name="improved-prompt-v2",
    description="Structured prompt with explicit grounding instructions",
    model="gpt-4o-mini",
    temperature=0.0,
    max_tokens=500,
    system_prompt="You are an HR assistant. ONLY use the provided context. If the context does not contain the answer, say 'I don't have that information.'",
    prompt_template="Context:\n{context}\n\nEmployee Question: {question}\n\nProvide a clear, accurate answer based ONLY on the context above:",
    embedding_model="text-embedding-3-small",
    chunk_size=512,
    top_k=5,
    metrics={"faithfulness": 0.94, "relevance": 0.91, "correctness": 0.87},
    cost=0.13,
    latency_p50=1.3
)
tracker.log_experiment(exp2)

# Compare
print(tracker.compare(["baseline-gpt4o-mini", "improved-prompt-v2"], "faithfulness"))
# => improved-prompt-v2: 0.94, baseline-gpt4o-mini: 0.82

6.2 A/B Testing for LLMs

A/B testing LLM configurations requires statistical rigor because LLM outputs vary between runs. The framework below implements proper experimental design with random assignment, sufficient sample sizes, and statistical significance testing (Mann-Whitney U test for non-normal distributions). It tracks per-request metrics across variants and produces confidence intervals, ensuring you can distinguish genuine improvements from random noise.

# A/B Testing framework for LLM applications
import random
from collections import defaultdict
from typing import Callable

class LLMABTest:
    """A/B test framework for comparing LLM configurations."""

    def __init__(self, test_name: str, variants: dict[str, Callable]):
        """
        Args:
            test_name: Identifier for this A/B test
            variants: dict mapping variant name -> callable that produces response
        """
        self.test_name = test_name
        self.variants = variants
        self.assignments = {}  # user_id -> variant
        self.metrics = defaultdict(lambda: defaultdict(list))

    def assign_variant(self, user_id: str) -> str:
        """Deterministically assign a user to a variant."""
        if user_id not in self.assignments:
            # Consistent hashing for deterministic assignment
            hash_val = hash(f"{self.test_name}:{user_id}") % len(self.variants)
            self.assignments[user_id] = list(self.variants.keys())[hash_val]
        return self.assignments[user_id]

    def get_response(self, user_id: str, **kwargs) -> tuple[str, str]:
        """Get response for a user (returns variant_name, response)."""
        variant = self.assign_variant(user_id)
        response = self.variants[variant](**kwargs)
        return variant, response

    def record_metric(self, user_id: str, metric_name: str, value: float):
        """Record a metric for a user's assigned variant."""
        variant = self.assignments.get(user_id)
        if variant:
            self.metrics[variant][metric_name].append(value)

    def get_results(self) -> dict:
        """Get aggregated A/B test results."""
        results = {}
        for variant, metrics in self.metrics.items():
            results[variant] = {}
            for metric_name, values in metrics.items():
                results[variant][metric_name] = {
                    "mean": round(sum(values) / len(values), 4),
                    "count": len(values),
                    "min": min(values),
                    "max": max(values)
                }
        return results

7. CI/CD for LLM Applications

Traditional CI/CD pipelines test for correctness — does the code produce the expected output? LLM CI/CD must also test for quality: are responses accurate, safe, and consistent with previous behavior? This section covers how to design LLM-aware pipelines that gate deployments on evaluation scores, enforce prompt version control, and run regression tests that catch quality degradations before they reach production.

7.1 LLM CI/CD Pipeline Design

Traditional CI/CD pipelines verify that code compiles and unit tests pass. LLM CI/CD pipelines must additionally verify that prompt changes do not degrade output quality, retrieval changes do not break existing answers, and model upgrades maintain expected behavior.

# GitHub Actions CI/CD pipeline for an LLM application
# .github/workflows/llm-ci.yml

name: LLM Application CI/CD

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
    paths:
      - 'prompts/**'
      - 'src/chains/**'
      - 'src/agents/**'
      - 'config/models.yaml'

jobs:
  # Stage 1: Fast checks (seconds)
  lint-and-unit:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - run: ruff check .
      - run: pytest tests/unit/ -v

  # Stage 2: Prompt regression tests (minutes)
  prompt-eval:
    runs-on: ubuntu-latest
    needs: lint-and-unit
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - name: Run prompt regression suite
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
          LANGCHAIN_API_KEY: ${{ secrets.LANGCHAIN_API_KEY }}
        run: |
          python -m eval.run_prompt_eval \
            --dataset eval_data/golden_set.json \
            --config config/eval_config.yaml \
            --threshold-file config/thresholds.yaml \
            --output results/prompt_eval.json
      - name: Check thresholds
        run: python -m eval.check_thresholds results/prompt_eval.json
      - uses: actions/upload-artifact@v4
        with:
          name: prompt-eval-results
          path: results/

  # Stage 3: RAG pipeline evaluation (minutes)
  rag-eval:
    runs-on: ubuntu-latest
    needs: lint-and-unit
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-python@v5
        with:
          python-version: '3.11'
      - run: pip install -r requirements.txt
      - name: Run RAGAS evaluation
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
        run: |
          python -m eval.run_ragas_eval \
            --dataset eval_data/rag_golden_set.json \
            --metrics faithfulness,answer_relevancy,context_precision \
            --thresholds '{"faithfulness": 0.85, "answer_relevancy": 0.80}'
      - name: Post results to PR
        if: github.event_name == 'pull_request'
        uses: actions/github-script@v7
        with:
          script: |
            const fs = require('fs');
            const results = JSON.parse(fs.readFileSync('results/ragas_eval.json'));
            const body = `## RAG Eval Results\n| Metric | Score | Threshold | Status |\n|--------|-------|-----------|--------|\n${
              Object.entries(results.scores).map(([k, v]) =>
                `| ${k} | ${v.toFixed(3)} | ${results.thresholds[k]} | ${v >= results.thresholds[k] ? 'PASS' : 'FAIL'} |`
              ).join('\n')
            }`;
            github.rest.issues.createComment({
              issue_number: context.issue.number,
              owner: context.repo.owner,
              repo: context.repo.repo,
              body
            });

  # Stage 4: Deploy (only on main)
  deploy:
    runs-on: ubuntu-latest
    needs: [prompt-eval, rag-eval]
    if: github.ref == 'refs/heads/main'
    steps:
      - uses: actions/checkout@v4
      - name: Deploy to production
        run: echo "Deploying LLM app..."

7.2 Regression Testing Prompts

Prompt regression testing catches quality degradations when you update prompts, switch models, or change retrieval configurations. The suite below maintains a library of test cases with expected outputs, runs the current prompt against each test case, and compares results using both exact matching and semantic similarity. This is critical for CI/CD pipelines where every prompt change must be validated against a known-good baseline before deployment.

# Prompt regression testing framework
import json
from pathlib import Path
from typing import Optional

class PromptRegressionSuite:
    """Regression test suite for prompt changes."""

    def __init__(self, golden_set_path: str, thresholds: dict):
        self.golden_set = self._load_golden_set(golden_set_path)
        self.thresholds = thresholds
        self.results = []

    def _load_golden_set(self, path: str) -> list[dict]:
        """Load golden test set with expected behaviors."""
        with open(path) as f:
            return json.load(f)

    def run_test(self, chain_fn, judge) -> dict:
        """Run the full regression suite."""
        passed = 0
        failed = 0
        failures = []

        for i, sample in enumerate(self.golden_set):
            # Generate response
            response = chain_fn(sample["input"])

            # Evaluate with LLM-as-judge
            score = judge.evaluate_correctness(
                question=sample["input"]["question"],
                response=response,
                reference=sample["expected_output"]
            )

            # Check threshold
            if score.score >= self.thresholds.get("correctness_min", 4):
                passed += 1
            else:
                failed += 1
                failures.append({
                    "sample_id": i,
                    "question": sample["input"]["question"],
                    "expected": sample["expected_output"],
                    "actual": response,
                    "score": score.score,
                    "reasoning": score.reasoning
                })

        total = len(self.golden_set)
        pass_rate = passed / total if total > 0 else 0

        result = {
            "total": total,
            "passed": passed,
            "failed": failed,
            "pass_rate": round(pass_rate, 4),
            "threshold": self.thresholds.get("pass_rate_min", 0.90),
            "suite_passed": pass_rate >= self.thresholds.get("pass_rate_min", 0.90),
            "failures": failures
        }

        return result

# Golden set format (eval_data/golden_set.json)
GOLDEN_SET_EXAMPLE = [
    {
        "input": {"question": "How many vacation days do new employees get?"},
        "expected_output": "New employees receive 15 days of PTO in their first year.",
        "tags": ["hr-policy", "pto", "critical"],
        "difficulty": "easy"
    },
    {
        "input": {"question": "What is the process for requesting a leave of absence?"},
        "expected_output": "Submit a formal request to HR at least 30 days in advance for planned leaves. Include dates, reason, and manager approval.",
        "tags": ["hr-policy", "leave", "critical"],
        "difficulty": "medium"
    }
]
Key Insight: Your golden test set is your most valuable evaluation asset. Curate it carefully with real user questions, edge cases, and previously failed examples. Every production bug should become a new golden test case. Over time, this dataset becomes an invaluable regression safety net that prevents you from re-introducing fixed issues.

8. Cost Tracking & Optimization

LLM API costs can escalate quickly in production — a single poorly-optimized agent workflow can consume thousands of dollars in tokens per day. Systematic cost tracking is essential for budget control, pricing decisions, and identifying optimization opportunities. This section covers token economics across providers, and demonstrates how to build cost monitoring dashboards that give you real-time visibility into spend by model, endpoint, and user.

8.1 Token Economics

Understanding token costs is critical for building sustainable AI applications. A single poorly optimized prompt can cost 10x more than a well-designed one for the same quality output.

Model Input Cost (per 1M tokens) Output Cost (per 1M tokens) Speed Best For
GPT-4o $2.50 $10.00 Fast Complex reasoning, evaluation
GPT-4o-mini $0.15 $0.60 Very fast Most production tasks
Claude 3.5 Sonnet $3.00 $15.00 Fast Long-form, nuanced tasks
Claude 3.5 Haiku $0.25 $1.25 Very fast Classification, extraction
Llama 3 70B (self-hosted) ~$0.50 (compute) ~$0.50 (compute) Depends on hardware Cost-sensitive, data privacy

8.2 Cost Monitoring Dashboards

A cost monitoring dashboard gives you real-time visibility into LLM API spend across models, endpoints, and time periods. The implementation below tracks every API call’s token usage and cost, provides daily summaries by model/endpoint, calculates running averages and trend detection, and flags spend anomalies. This is the operational foundation for cost optimization — you can’t optimize what you don’t measure.

# Cost tracking and monitoring for LLM applications
from dataclasses import dataclass, field
from datetime import datetime, timedelta
from collections import defaultdict
from typing import Optional

# Token pricing per million tokens (as of early 2026)
MODEL_PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "gpt-4-turbo": {"input": 10.00, "output": 30.00},
    "claude-3-5-sonnet": {"input": 3.00, "output": 15.00},
    "claude-3-5-haiku": {"input": 0.25, "output": 1.25},
    "text-embedding-3-small": {"input": 0.02, "output": 0.0},
    "text-embedding-3-large": {"input": 0.13, "output": 0.0},
}

@dataclass
class TokenUsage:
    """Track token usage for a single LLM call."""
    model: str
    input_tokens: int
    output_tokens: int
    timestamp: str = field(default_factory=lambda: datetime.now().isoformat())
    endpoint: str = ""
    user_id: str = ""

    @property
    def cost(self) -> float:
        pricing = MODEL_PRICING.get(self.model, {"input": 0, "output": 0})
        input_cost = (self.input_tokens / 1_000_000) * pricing["input"]
        output_cost = (self.output_tokens / 1_000_000) * pricing["output"]
        return round(input_cost + output_cost, 6)

class CostTracker:
    """Production cost tracking and alerting."""

    def __init__(self, daily_budget: float = 100.0, alert_threshold: float = 0.8):
        self.daily_budget = daily_budget
        self.alert_threshold = alert_threshold
        self.usage_log: list[TokenUsage] = []

    def log_usage(self, usage: TokenUsage):
        """Log a token usage event."""
        self.usage_log.append(usage)

        # Check if we're approaching the budget
        today_cost = self.get_daily_cost()
        if today_cost >= self.daily_budget * self.alert_threshold:
            self._send_alert(today_cost)

    def get_daily_cost(self, date: Optional[str] = None) -> float:
        """Get total cost for a specific day."""
        target = date or datetime.now().strftime("%Y-%m-%d")
        day_usage = [u for u in self.usage_log if u.timestamp.startswith(target)]
        return sum(u.cost for u in day_usage)

    def get_cost_by_model(self) -> dict:
        """Breakdown cost by model."""
        costs = defaultdict(float)
        tokens = defaultdict(lambda: {"input": 0, "output": 0})
        for u in self.usage_log:
            costs[u.model] += u.cost
            tokens[u.model]["input"] += u.input_tokens
            tokens[u.model]["output"] += u.output_tokens
        return {
            model: {
                "cost": round(cost, 4),
                "input_tokens": tokens[model]["input"],
                "output_tokens": tokens[model]["output"]
            }
            for model, cost in sorted(costs.items(), key=lambda x: -x[1])
        }

    def get_cost_by_endpoint(self) -> dict:
        """Breakdown cost by API endpoint/feature."""
        costs = defaultdict(float)
        counts = defaultdict(int)
        for u in self.usage_log:
            costs[u.endpoint] += u.cost
            counts[u.endpoint] += 1
        return {
            ep: {"total_cost": round(cost, 4), "requests": counts[ep],
                 "avg_cost": round(cost / counts[ep], 6)}
            for ep, cost in sorted(costs.items(), key=lambda x: -x[1])
        }

    def _send_alert(self, current_cost: float):
        """Send cost alert (integrate with Slack, PagerDuty, etc.)."""
        pct = round(current_cost / self.daily_budget * 100, 1)
        print(f"ALERT: Daily LLM spend at ${current_cost:.2f} "
              f"({pct}% of ${self.daily_budget} budget)")

    def optimization_recommendations(self) -> list[str]:
        """Generate cost optimization recommendations."""
        recs = []
        by_model = self.get_cost_by_model()

        # Check if expensive models are used for simple tasks
        if "gpt-4o" in by_model and by_model["gpt-4o"]["cost"] > 10:
            recs.append(
                "Consider routing simple queries to gpt-4o-mini "
                f"(currently spending ${by_model['gpt-4o']['cost']:.2f} on gpt-4o)"
            )

        by_endpoint = self.get_cost_by_endpoint()
        for ep, data in by_endpoint.items():
            if data["avg_cost"] > 0.05:
                recs.append(
                    f"Endpoint '{ep}' averages ${data['avg_cost']:.4f}/request — "
                    "consider caching or prompt optimization"
                )

        return recs

# Usage
tracker = CostTracker(daily_budget=50.0)

tracker.log_usage(TokenUsage(
    model="gpt-4o-mini", input_tokens=1500, output_tokens=300,
    endpoint="/api/chat", user_id="user_123"
))

print(f"Daily cost: ${tracker.get_daily_cost():.4f}")
print(f"By model: {tracker.get_cost_by_model()}")
print(f"Recommendations: {tracker.optimization_recommendations()}")
Common Mistake: The biggest cost driver in most LLM apps is not the model you choose — it is unnecessarily large prompts. System prompts with 2000 tokens sent on every request, retrieved contexts with 5000 tokens when 1000 would suffice, and verbose few-shot examples that could be compressed. Audit your input token counts before switching to a cheaper model.

Exercises & Self-Assessment

Exercise 1

Build an LLM-as-Judge Evaluator

Implement a complete LLM-as-judge system that evaluates RAG responses on three dimensions: correctness, faithfulness, and relevance. Requirements:

  1. Create structured output schemas for each evaluation dimension
  2. Write clear evaluation prompts with scoring rubrics
  3. Evaluate a set of 10 sample RAG responses
  4. Compare judges: GPT-4o vs Claude 3.5 Sonnet — do they agree?
Exercise 2

RAGAS Evaluation Pipeline

Build and run a complete RAGAS evaluation pipeline:

  1. Create a golden evaluation dataset with 20 question-answer pairs
  2. Run your RAG pipeline to generate answers and capture retrieved contexts
  3. Evaluate using all RAGAS metrics (faithfulness, relevance, precision, recall)
  4. Identify the weakest dimension and propose a fix
  5. Implement the fix and re-evaluate — did the metrics improve?
Exercise 3

Set Up Observability

Choose one observability platform (LangSmith, Langfuse, or Phoenix) and:

  1. Instrument a LangChain RAG pipeline with full tracing
  2. Send 50 queries through the pipeline
  3. Analyze the traces: what is the average latency? Token usage? Cost?
  4. Find the slowest query — why was it slow? Can you optimize it?
Exercise 4

CI/CD Pipeline for Prompts

Design and implement a CI/CD pipeline that prevents prompt regressions:

  1. Create a golden test set of 15 critical Q&A pairs
  2. Write a GitHub Actions workflow that evaluates prompts on every PR
  3. Set pass/fail thresholds (e.g., faithfulness >= 0.85, pass rate >= 90%)
  4. Make a "bad" prompt change and verify the pipeline catches it
Exercise 5

Reflective Questions

  1. Why is LLM-as-judge evaluation both powerful and dangerous? What safeguards would you implement?
  2. If your RAGAS faithfulness score is 0.95 but context precision is 0.60, what does this tell you about your system? What would you fix first?
  3. How would you evaluate an agent that has multiple valid paths to the correct answer? Why is trajectory evaluation harder than output evaluation?
  4. Design a cost optimization strategy for an LLM app that processes 100,000 queries per day. What are the biggest levers?
  5. What are the limitations of automated metrics like BLEU and ROUGE for evaluating LLM outputs? When would you still use them?

LLMOps Configuration Document Generator

Configure your LLMOps evaluation and observability setup. Download as Word, Excel, PDF, or PowerPoint.

Draft auto-saved

All data stays in your browser. Nothing is sent to or stored on any server.

Conclusion & Next Steps

You now have a comprehensive understanding of LLM evaluation and LLMOps — the discipline that separates prototype AI apps from production-grade systems. Here are the key takeaways from Part 15:

  • The Core Triad — Correctness, faithfulness, and relevance are the three fundamental evaluation dimensions for any LLM application, each measuring a distinct aspect of output quality
  • Evaluation methods — Human evaluation remains the gold standard but does not scale; LLM-as-judge provides the best balance of quality and cost; automated metrics are essential for CI/CD
  • RAGAS — The leading framework for RAG evaluation, providing independent metrics for both retrieval quality (context precision/recall) and generation quality (faithfulness/relevance)
  • Agent evaluation — Requires evaluating both outcomes and trajectories, tracking tool selection accuracy and efficiency alongside final answer quality
  • Observability — LangSmith, Langfuse, and Phoenix provide the tracing and monitoring infrastructure that makes debugging and optimization possible
  • CI/CD for LLMs — Every prompt change should be regression-tested against a golden dataset with clear pass/fail thresholds before deployment
  • Cost tracking — Token economics, per-endpoint cost monitoring, and optimization recommendations are essential for sustainable AI applications

Next in the Series

In Part 16: Production AI Systems, we tackle the infrastructure side — building FastAPI services for LLM APIs, implementing async streaming with SSE and WebSockets, queuing with Celery and Redis, semantic caching, scaling with vLLM and TGI, and monitoring production latency at P50/P95/P99.

Technology