Back to Technology

Fine-tuning, RLHF & Model Alignment

March 30, 2026 Wasil Zafar 32 min read

LoRA, instruction tuning, RLHF, and DPO — modern techniques for adapting foundation models to specific tasks and aligning them with human preferences and values.

Table of Contents

  1. Why Fine-Tune?
  2. Parameter-Efficient Methods
  3. Instruction Tuning
  4. RLHF & Alignment
  5. Direct Preference Optimization
  6. Method Comparison Tables
  7. Code: LoRA Fine-tuning
  8. Code: DPO Training
  9. Code: Dataset Format
  10. Exercises
  11. Model Card Generator
  12. Conclusion & Next Steps

AI in the Wild: Real-World Applications & Ethics

Your 24-part learning path • Currently on Step 10
AI & ML Landscape Overview
Paradigms, ecosystem map, real-world applications at a glance
ML Foundations for Practitioners
Supervised learning, bias-variance, model evaluation
Natural Language Processing
Tokenization, embeddings, transformers, semantic search
Computer Vision in the Real World
CNNs, ViTs, detection, segmentation, deployment patterns
Recommender Systems
Collaborative filtering, content-based, two-tower models
Reinforcement Learning Applications
Q-learning, policy gradients, RLHF, real-world deployments
Conversational AI & Chatbots
Dialogue systems, intent detection, RAG, production bots
Large Language Models
Architecture, scaling laws, capabilities, limitations
Prompt Engineering & In-Context Learning
Chain-of-thought, few-shot, structured outputs, prompt patterns
10
Fine-tuning, RLHF & Model Alignment
LoRA, instruction tuning, DPO, alignment techniques
You Are Here
11
Generative AI Applications
Diffusion models, GANs, image/audio/video generation
12
Multimodal AI
Vision-language models, audio-text, cross-modal retrieval
13
AI Agents & Agentic Workflows
Tool use, planning, memory, multi-agent orchestration
14
AI in Healthcare & Life Sciences
Diagnostics, drug discovery, clinical NLP, regulatory landscape
15
AI in Finance & Fraud Detection
Credit scoring, anomaly detection, algorithmic trading
16
AI in Autonomous Systems & Robotics
Perception, planning, control, sim-to-real transfer
17
AI Security & Adversarial Robustness
Adversarial attacks, poisoning, model extraction, defences
18
Explainable AI & Interpretability
SHAP, LIME, attention, mechanistic interpretability
19
AI Ethics & Bias Mitigation
Fairness metrics, dataset auditing, debiasing techniques
20
MLOps & Model Deployment
CI/CD for ML, feature stores, monitoring, drift detection
21
Edge AI & On-Device Intelligence
Quantization, pruning, TFLite, CoreML, embedded inference
22
AI Infrastructure, Hardware & Scaling
GPUs, TPUs, distributed training, memory hierarchy
23
Responsible AI Governance
Risk frameworks, model cards, auditing, organisational practice
24
AI Policy, Regulation & Future Directions
EU AI Act, global frameworks, emerging risks, what's next
AI in the Wild Part 10 of 24

About This Article

This article covers the full adaptation and alignment stack for modern LLMs — from parameter-efficient fine-tuning methods like LoRA and QLoRA, through instruction tuning and dataset construction, to the alignment techniques (RLHF, DPO) that transform a raw pre-trained model into a safe and helpful assistant.

Advanced Fine-tuning Alignment

Why Fine-Tune?

Pre-trained language models encode vast world knowledge acquired from trillions of tokens of text, code, and structured data. Yet in their base form they are not immediately useful for production tasks: a base model responds by predicting statistically likely continuations with no notion of following instructions, maintaining safety constraints, or producing outputs in a prescribed format. Fine-tuning is the process that bridges the gap between raw capability and task-ready behaviour.

The motivation for fine-tuning operates on three distinct dimensions. First, task specialisation: moving the model's output distribution from the broad training corpus toward a narrow target distribution such as medical ICD coding, legal contract drafting, or customer support dialogue. Second, behavioural alignment: teaching the model to follow instructions, refuse harmful requests, cite sources, and behave consistently — none of which emerge reliably from pre-training alone. Third, inference efficiency: a well-tuned smaller model (7B–13B parameters) can match or exceed a much larger base model on a specific domain, dramatically reducing serving costs.

Key Insight: The choice of fine-tuning method should be driven by the distance between your target distribution and the pre-training distribution. If your task uses standard language and the model already knows the relevant facts, prompt engineering or a small LoRA adapter will usually suffice. Full fine-tuning is rarely worth the cost unless you have domain-specific data at scale and the distribution is genuinely far from the pre-training corpus.

Pre-Training vs. Fine-Tuning

Pre-training is the expensive, one-time process of learning general language representations from internet-scale data. Training GPT-4 is estimated to have consumed tens of thousands of A100-GPU-months and hundreds of millions of dollars in compute. The result is a model with broad world knowledge, reasoning ability, and language fluency — but no particular alignment with human intent.

Fine-tuning is comparatively cheap: you start from the pre-trained checkpoint and continue training on a much smaller, task-specific dataset. A LoRA fine-tune of a 7B model on 10,000 examples typically completes in under an hour on a single A100 GPU at a cost of a few dollars. The pre-trained weights serve as a powerful initialisation that encodes everything the model knows; fine-tuning steers it toward the desired behaviour without relearning from scratch.

A third pattern, continual pre-training, sits between the two: the model is trained on a large domain-specific corpus (e.g., PubMed abstracts, legal filings, or proprietary codebases) using the standard language modelling objective. This is appropriate when the domain vocabulary and facts are genuinely absent from the original pre-training data — as in highly specialised scientific or technical domains. It is significantly more expensive than fine-tuning but far cheaper than full pre-training.

When to Fine-Tune vs. Prompt

A practical decision framework for choosing your adaptation strategy:

Use Prompt Engineering When
  • The model already performs adequately on the task with careful prompting
  • You have fewer than ~100 high-quality labelled examples
  • Latency and cost constraints do not yet require a smaller model
  • The task changes frequently and maintaining a fine-tuned model is operationally expensive
  • You need rapid iteration without GPU infrastructure
Use Fine-Tuning (LoRA/QLoRA) When
  • You have 500–100,000 high-quality labelled examples
  • The output format must be highly consistent (e.g., structured JSON, medical codes)
  • You need to reduce serving cost by using a smaller model
  • Domain-specific terminology, style, or tone is critical
  • Latency requirements preclude long system prompts and few-shot examples
Use Full Fine-Tuning When
  • You have millions of domain-specific training examples
  • The domain is so specialised that even the embedding layer needs updating (e.g., protein sequences, assembly code)
  • You have the infrastructure to train and serve a fully adapted model
  • Maximum performance on a single, stable task justifies the cost

Parameter-Efficient Fine-Tuning Methods

Parameter-efficient fine-tuning (PEFT) is the practical middle ground between prompt engineering and full fine-tuning. The core idea: update 0.1–1% of parameters while keeping 99%+ frozen. The pre-trained model already encodes most of what you need; PEFT introduces a small number of new parameters that steer the frozen backbone toward target behaviour. Benefits include drastically reduced memory and compute, negligible risk of catastrophic forgetting, and the ability to maintain multiple adapters on a single shared base model.

LoRA & QLoRA

LoRA (Low-Rank Adaptation) is based on the observation that the weight updates produced by fine-tuning have low intrinsic rank — most of the useful adaptation can be captured in a much smaller subspace than the full weight matrix. Instead of updating the full weight matrix W ∈ ℝ^(d×k), LoRA represents the update as the product of two low-rank matrices: ΔW = BA, where B ∈ ℝ^(d×r) and A ∈ ℝ^(r×k), with r ≪ min(d, k). The original weights are frozen; only A and B are trained.

Practical implications: for a 7B model, a rank-16 LoRA targeting the attention layers reduces trainable parameters from 7 billion to roughly 7 million — a 1000x reduction. At inference time, the adapter can be merged into the base weights (W' = W + BA) producing zero additional latency. LoRA adapters are also tiny on disk (tens of megabytes vs. tens of gigabytes for the full model), enabling easy sharing and multi-tenant serving.

Key hyperparameters: The rank r controls the capacity of the adapter (higher = more expressive but more parameters). The scaling factor α (typically set to 2r) controls the effective learning rate of the adapter. Target modules are typically the query and value projections of attention (q_proj, v_proj), though extending to all linear layers improves quality at modest cost. LoRA dropout (0.05–0.1) provides regularisation for small datasets.

QLoRA extends LoRA to 4-bit quantised base models. The frozen base model weights are stored in NF4 (4-bit NormalFloat) format with double quantisation, reducing memory by ~8x compared to FP16. The LoRA adapters themselves are trained in BF16 with paged optimisers that move optimizer states to CPU RAM when they overflow GPU VRAM. QLoRA enables fine-tuning of 65B parameter models on a single 48GB GPU — a task that would otherwise require 8+ A100s.

Adapters & Prefix Tuning

Adapter layers insert small bottleneck MLP modules after each attention and feed-forward sub-layer. They project from the hidden dimension down to a smaller bottleneck (e.g., 64 units), apply a non-linearity, and project back up. Only adapter parameters are trained. The downside: unlike LoRA, adapters cannot be merged into base weights, so they add a small inference latency overhead.

Prefix tuning prepends a set of learnable "soft token" vectors to the key and value sequences in every attention layer. These vectors are never decoded; they act as a persistent context that steers attention patterns. Prefix tuning is highly parameter-efficient but is limited in expressiveness — it can shift style and format but struggles with tasks requiring new factual knowledge.

Prompt tuning operates only at the input embedding layer: a small set of learnable vectors is prepended to the input embeddings. At scale (11B+ models), prompt tuning approaches full fine-tuning quality; at 7B and below, it significantly underperforms LoRA. BitFit takes a different approach, training only the bias terms of the model — effectively free in terms of parameter count (biases are <0.1% of total parameters) but limited in expressiveness.

In practice, LoRA has become the default PEFT method due to its favourable accuracy-efficiency trade-off, zero inference overhead after merging, and broad library support (HuggingFace PEFT, Axolotl, LLaMA-Factory). The multi-task serving pattern — a shared frozen base with hot-swappable per-task adapters — is increasingly used in production to serve dozens of specialised models from a single GPU cluster.

Case Study

Fine-tuning Llama 3 on Medical QA with QLoRA

A health-tech team at a UK NHS Trust needed a clinical question-answering assistant for junior doctors. They adapted Llama 3 8B using QLoRA on 4,200 curated (question, answer) pairs from UpToDate and internal clinical guidelines. The full fine-tune on a single A100 80GB GPU took 4 hours. Evaluation on 200 held-out questions showed 78% accuracy vs 61% for the base model with a carefully engineered system prompt. In production, the model runs behind a retrieval layer (RAG) that fetches relevant guideline text before calling the fine-tuned model, combining factual grounding with domain-adapted reasoning style.

QLoRA Medical AI Production LLM

Instruction Tuning

Instruction tuning (also called supervised fine-tuning, or SFT) is the stage that transforms a raw pre-trained model — which simply predicts the next token — into an assistant that follows natural language instructions. The model is trained on a dataset of (instruction, response) pairs using the standard next-token prediction loss, but only the response tokens contribute to the loss gradient. This is a qualitative behavioural shift: the model learns that when it sees an instruction, the expected behaviour is to fulfil it helpfully and completely.

The data format matters enormously. Most modern models use a structured chat template with explicit role boundaries. The ChatML format (<|im_start|>system\n...<|im_end|>) or the Llama 3 template (<|begin_of_text|><|start_header_id|>system<|end_header_id|>...<|eot_id|>) encode the system prompt, user turn, and assistant response as distinct segments. Mixing formats across training examples is a common failure mode that produces inconsistent output formatting.

Dataset Construction

The canonical open SFT datasets include FLAN (task-specific instruction collections across 60+ NLP benchmarks), Dolly 15K (15,000 human-written instruction-response pairs from Databricks employees), OpenAssistant Conversations (161,000 messages in 35 languages with ranked responses), and the ShareGPT corpus (millions of real ChatGPT conversations scraped from the web).

Synthetic data generation has become standard practice. The Self-Instruct method (Wang et al., 2022) uses a strong LLM to generate new instruction-response pairs from a seed set, bootstrapping the dataset with minimal human effort. Evol-Instruct (used in WizardLM and OpenHermes) applies an LLM to progressively rewrite instructions to be more complex, diverse, and specific. Orca-style data distils reasoning chains from GPT-4, training smaller models to produce step-by-step explanations rather than just final answers.

The critical finding from LIMA (Zhou et al., 2023) is that data quality dominates data quantity: 1,000 carefully curated, diverse, high-quality examples can match the SFT quality of tens of thousands of noisier examples. Deduplication, length calibration (mixing short and long responses), and instruction diversity (not just Q&A but also coding, creative writing, summarisation, classification) are more important than raw dataset size.

SFT Best Practices

Practical guidance for SFT that practitioners frequently discover the hard way:

  • Format consistency above all: A single format mismatch in 1% of your data can cause systematic output formatting failures. Validate every example with a parser before training.
  • Pack sequences efficiently: Concatenate multiple short examples into single training sequences (separated by EOS tokens) to maximise GPU utilisation. Without packing, a 2048-token context is typically 30–40% empty.
  • Train on responses only: Mask loss on prompt tokens. Training on both prompt and response teaches the model to complete prompts, which is different from following them.
  • Watch for domain imbalance: If 80% of your data is code Q&A, the model will regress on general instruction following. Maintain category diversity proportional to expected production traffic.
  • Evaluate on held-out instructions: SFT benchmark suites like MT-Bench and AlpacaEval provide LLM-judged evaluation of instruction-following quality across eight categories (writing, roleplay, reasoning, math, coding, extraction, STEM, humanities).
Production Warning: SFT alone is insufficient for safety. An instruction-tuned model that has not been aligned via RLHF or DPO will follow harmful instructions if phrased plausibly — as demonstrated by the "jailbroken" early releases of Alpaca and Vicuna. Never deploy a base SFT model in a consumer-facing context without an alignment stage. At minimum, apply DPO on a refusal-heavy preference dataset before shipping.

RLHF & Alignment

Reinforcement Learning from Human Feedback (RLHF) is the alignment technique behind InstructGPT, ChatGPT, Claude, Gemini, and every major production AI assistant. The fundamental problem it addresses: a model can follow instructions while simultaneously being unhelpful, deceptive, or harmful. Instruction tuning teaches the format; alignment teaches the values. RLHF introduces a training signal derived from human preference judgements rather than ground-truth labels.

The canonical RLHF pipeline operates in three stages. First, supervised fine-tuning (SFT) on a curated instruction dataset produces the SFT baseline — a well-behaved starting point. Second, reward model training uses human preference data to learn a scoring function. Third, PPO optimisation updates the SFT model's policy to maximise reward model scores while staying close to the SFT baseline via a KL divergence penalty.

Reward Modeling

The reward model (RM) is a neural network that takes a (prompt, response) pair and outputs a scalar score representing how much a human would prefer that response. It is typically initialised from the SFT model with the final token prediction head replaced by a linear scalar head.

Training data consists of pairwise preference comparisons: for a given prompt, human annotators choose which of two model responses they prefer. The Bradley-Terry model converts these pairwise comparisons into a ranking objective: the RM is trained to assign higher scores to preferred responses than rejected ones, minimising a cross-entropy loss over preference pairs.

Reward hacking is the primary failure mode: the policy learns to game the reward model rather than genuinely improving quality. Common manifestations include verbosity (longer responses score higher even when unhelpful), sycophancy (the RM was trained on human preferences that reward agreement), and formatting artifacts (bullet points and headers score higher regardless of content quality). Mitigation strategies include using an ensemble of reward models, adding a KL penalty against the SFT reference, and periodic reward model re-training on updated preference data.

Constitutional AI (CAI) and RLAIF (RL from AI Feedback) reduce dependence on expensive human annotation by using a strong LLM (Claude, GPT-4) to generate preference labels according to a set of principles (the "constitution"). This scales preference data collection dramatically but introduces the biases and failure modes of the judge model.

PPO & the RLHF Pipeline

Proximal Policy Optimisation (PPO) is the RL algorithm used to optimise the policy against the learned reward model. At each training step: (1) the policy generates a batch of responses to prompts from the preference data distribution; (2) the RM scores each response; (3) PPO computes the advantage estimate and updates the policy weights to increase the probability of high-reward responses. The clipped surrogate objective prevents excessively large updates that could destabilise training.

The KL divergence penalty is critical: without it, PPO rapidly drives the policy away from the SFT initialisation toward reward-hacking degenerate outputs. The penalty is applied as: R_total = R_RM - β * KL(π_θ || π_ref), where π_ref is the frozen SFT reference model. β is a hyperparameter controlling the trade-off between alignment and staying close to SFT behaviour.

The practical challenge of RLHF-PPO is the four-model memory requirement: the active policy (trainable), the frozen SFT reference (for KL computation), the reward model, and a critic model (value function for PPO advantage estimation). For 7B+ models, this requires a multi-GPU setup even with quantisation. This memory overhead is a primary driver for the adoption of simpler alignment alternatives like DPO.

Direct Preference Optimization

Direct Preference Optimization (DPO, Rafailov et al., 2023) achieves comparable alignment quality to RLHF while eliminating both the reward model and the PPO optimiser. The mathematical insight: the optimal RLHF policy can be expressed as a closed-form function of preference data, which means we can directly optimise preference probabilities without ever training a separate reward model or running online rollouts.

The DPO loss directly trains the policy to assign higher likelihood to preferred responses and lower likelihood to rejected responses, relative to a reference (SFT) model. The reference model prevents the policy from collapsing — without it, the model would simply assign near-zero probability to rejected responses rather than genuinely improving chosen response quality. Practical advantages: trains as fast as SFT (one forward-backward pass per preference pair), stable without careful hyperparameter tuning, no four-model memory overhead, and no reward hacking from a separate RM.

DPO vs. RLHF

The two approaches differ on several dimensions relevant to practitioners:

  • Training complexity: DPO is equivalent to SFT in implementation complexity. RLHF-PPO requires a custom training loop with four model copies, online generation, reward scoring, and GAE advantage estimation.
  • Memory: DPO needs two model copies (policy + reference). RLHF needs four (policy + reference + RM + critic).
  • Stability: DPO converges reliably without reward model hyperparameter sensitivity. PPO is notoriously sensitive to learning rate, KL coefficient, and batch size.
  • Data: DPO is offline — trained on a fixed preference dataset. PPO-based RLHF can collect on-policy preference data as the model improves, which is theoretically superior for continuous alignment.
  • Alignment quality: DPO and RLHF-PPO achieve similar MT-Bench scores at 7B scale. At 70B+ scale with large preference datasets and on-policy data collection, RLHF maintains a quality edge.

Practical Implementation

The DPO family has expanded significantly since the original paper. IPO (Identity Policy Optimisation) adds a regularisation term that makes training robust to small or noisy preference datasets. SimPO removes the reference model entirely, using response length-normalised likelihood as an implicit reward. KTO (Kahneman-Tversky Optimisation) trains on unpaired binary feedback (good/bad labels) rather than preference pairs, enabling use of simpler annotation schemes. ORPO merges the SFT and alignment training stages into a single pass, reducing total training time.

Practical guidance: start with DPO on 5,000–50,000 high-quality preference pairs; only move to PPO-based RLHF if DPO demonstrably underperforms on your evaluation suite and you have the infrastructure to support four-model training. Preference data quality matters more than quantity: a well-calibrated annotator team producing 10,000 pairs will outperform noisy LLM-generated preference labels at 100,000 pairs.

Key Insight: DPO achieves InstructGPT-level alignment quality without a separate reward model, without PPO, and without the four-model memory overhead. For most teams, DPO is the right default alignment method. The canonical recipe: (1) SFT on 5,000–50,000 high-quality instruction-response pairs; (2) DPO on 5,000–50,000 preference pairs. Total training time for a 7B model: under 12 hours on 4x A100s.

Fine-Tuning & Alignment Method Comparisons

Choosing the right fine-tuning and alignment method requires weighing multiple factors simultaneously. The following tables provide a structured comparison across the most important dimensions.

Fine-Tuning Methods Comparison

Method Trainable Params GPU Memory (7B) Training Cost Quality Best For
Full Fine-Tuning 100% ~140 GB (FP16) $$$$ (8+ A100s) Highest Large domain shifts, millions of examples
LoRA 0.1–1% ~18 GB (BF16) $ (single GPU) High Task adaptation, format enforcement, 500–100K examples
QLoRA 0.1–1% ~6 GB (4-bit) $ (consumer GPU) High (–1–3% vs LoRA) Single-GPU fine-tuning, resource-constrained environments
Prefix Tuning <0.1% ~18 GB (BF16) $ (single GPU) Medium Style/format adaptation, moderate expressiveness
Instruction Tuning (SFT) 100% or PEFT Varies $–$$ High (behavioural) Converting base model to instruction follower
RLHF (PPO) Policy only 4× model memory $$$ Highest (alignment) Safety-critical deployment, continuous alignment
DPO Policy only 2× model memory $$ High (alignment) Offline alignment, most production use cases

Model Alignment Techniques Comparison

Technique How it Works Human Involvement Cost Alignment Quality Used By
SFT Train on (instruction, response) pairs with next-token loss High (labelling responses) $ Good (format/behaviour) All major labs (stage 1)
RLHF Train RM on preferences; PPO optimises policy vs. RM High (preference pairs) $$$ Very High OpenAI (InstructGPT, ChatGPT), Google
DPO Directly optimise preference probabilities, no RM Moderate (preference pairs) $$ High (comparable to RLHF) Meta (Llama 3), Mistral, many open models
RLAIF LLM generates preference labels (replaces human labellers) Low (AI labelling) $ High (with strong judge) Google (Gemini), Anthropic (experiments)
Constitutional AI LLM critiques and revises responses per a set of principles Low (write constitution) $$ Very High (safety) Anthropic (Claude)

Code: LoRA Fine-Tuning with PEFT

The following demonstrates a complete LoRA fine-tuning run using HuggingFace PEFT and TRL's SFTTrainer. This setup fine-tunes Mistral-7B on a medical QA dataset using only 0.18% of the model's parameters on a single GPU.

from transformers import AutoModelForCausalLM, AutoTokenizer, TrainingArguments
from peft import LoraConfig, get_peft_model, TaskType
from trl import SFTTrainer
import torch

# Load base model (4-bit quantized to fit on single GPU)
model_id = "mistralai/Mistral-7B-v0.1"
model = AutoModelForCausalLM.from_pretrained(
    model_id, load_in_4bit=True, device_map="auto",
    torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained(model_id)

# LoRA configuration: inject rank-16 adapters into attention layers only
lora_config = LoraConfig(
    task_type=TaskType.CAUSAL_LM,
    r=16,                    # rank of update matrices (trade-off: quality vs params)
    lora_alpha=32,           # scaling factor (lora_alpha/r = effective learning rate)
    target_modules=["q_proj", "v_proj"],  # which attention matrices to adapt
    lora_dropout=0.05,
    bias="none"
)
model = get_peft_model(model, lora_config)
model.print_trainable_parameters()
# → trainable params: 6,815,744 || all params: 3,758,071,808 || trainable%: 0.18%

# SFTTrainer: fine-tune on instruction-formatted dataset
trainer = SFTTrainer(
    model=model, tokenizer=tokenizer,
    train_dataset=train_dataset,  # {"text": "<s>[INST] {instruction} [/INST] {response}</s>"}
    max_seq_length=2048,
    args=TrainingArguments(
        output_dir="./lora-medical-qa",
        num_train_epochs=3,
        per_device_train_batch_size=4,
        gradient_accumulation_steps=4,  # effective batch = 16
        learning_rate=2e-4,
        lr_scheduler_type="cosine",
        warmup_ratio=0.05,
        save_steps=100, logging_steps=25
    )
)
trainer.train()
# Training cost: ~$5 on A100 for 1K examples vs $500+ full fine-tune

Code: DPO Training

DPO requires preference pairs — (prompt, chosen_response, rejected_response) triplets. Here we demonstrate the full DPO training loop using TRL's DPOTrainer, which handles the reference model computation and the DPO loss automatically.

from trl import DPOTrainer, DPOConfig
from datasets import Dataset

# DPO: directly optimize preferences without separate reward model
# Dataset format: (prompt, chosen_response, rejected_response) triplets
preference_data = Dataset.from_dict({
    "prompt": [
        "Explain photosynthesis to a 10-year-old.",
        "Write a professional email declining a meeting."
    ],
    "chosen": [
        "Plants make their own food using sunlight! Here's how it works: leaves capture sunlight, roots drink water, and together they make sugar the plant uses for energy. They also release oxygen — that's the air we breathe!",
        "Thank you for the invitation. Unfortunately, I have a prior commitment at that time and won't be able to attend. I'd welcome the opportunity to connect at a future date."
    ],
    "rejected": [
        "Photosynthesis is the process by which organisms containing the pigment chlorophyll convert light energy into chemical energy stored in glucose.",
        "I can't make it sorry. Maybe another time."
    ]
})

dpo_config = DPOConfig(
    beta=0.1,  # KL divergence penalty: higher = stay closer to reference model
    max_prompt_length=512,
    max_length=1024,
    num_train_epochs=1,
    learning_rate=5e-7,  # very small LR for preference alignment
    per_device_train_batch_size=2
)

trainer = DPOTrainer(
    model=model,
    ref_model=None,  # None = use model itself as reference (PEFT-compatible)
    args=dpo_config,
    train_dataset=preference_data,
    tokenizer=tokenizer
)
trainer.train()
# DPO advantage over RLHF: no separate reward model, more stable training, simpler

Code: Instruction Tuning Dataset Format

High-quality instruction tuning datasets use the multi-turn chat format with explicit role markers. The following shows the OpenAI-compatible messages format used by most modern SFT pipelines, with a domain-specific medical coding example.

[
  {
    "messages": [
      {"role": "system", "content": "You are a medical coding specialist following ICD-11 standards."},
      {"role": "user",   "content": "Code: Patient presents with moderate persistent asthma without complications."},
      {"role": "assistant", "content": "CA23.1 — Asthma, moderate persistent, without complications\nRationale: CA23 = Asthma; .1 = moderate persistent severity; no mention of complications or status asthmaticus."}
    ]
  },
  {
    "messages": [
      {"role": "system", "content": "You are a medical coding specialist following ICD-11 standards."},
      {"role": "user",   "content": "Code: Type 2 diabetes with stage 3 chronic kidney disease."},
      {"role": "assistant", "content": "5A11 + MF56.2 — Type 2 diabetes mellitus (5A11) with Chronic kidney disease, stage 3 (MF56.2)\nRationale: ICD-11 uses extension codes for comorbidities. Diabetes and CKD are coded separately and linked via extension."}
    ]
  }
]

Advanced Alignment Topics

Beyond the core SFT-RLHF-DPO pipeline, several advanced alignment research directions are shaping the future of safe AI development. Practitioners building production systems benefit from understanding these techniques even if they are not yet in their immediate stack.

Scalable Oversight

Scalable oversight addresses the fundamental challenge of AI alignment: how do you align a model that is more capable than the humans supervising it? Standard RLHF assumes humans can correctly identify the better of two model responses — but for highly technical domains (advanced mathematics, cutting-edge research, long-horizon planning), this assumption breaks down. The model may produce a plausible-sounding but incorrect response that human evaluators cannot detect.

Proposed solutions include debate (Irving et al., 2018): two AI agents debate the correct answer while a human judge arbitrates, with the insight that it is easier to identify a flaw in an argument than to verify a complex claim from scratch. Recursive reward modelling trains assistants to help humans give more informed feedback, improving the quality of the reward signal. Iterated amplification bootstraps oversight by decomposing complex tasks into simpler sub-tasks the human can evaluate directly.

Process Reward Models

Process Reward Models (PRMs) score the quality of each reasoning step in a chain-of-thought, rather than only the final answer. Standard reward models are outcome-based — they score final responses — which can incentivise the model to arrive at correct answers through flawed reasoning. PRMs provide denser training signal and better generalisation for multi-step reasoning tasks. OpenAI's Let's Verify Step by Step paper demonstrated that process supervision significantly outperforms outcome supervision on mathematical problem solving, and PRMs are now standard in frontier math reasoning models (o1, o3, DeepSeek-R1).

Constitutional AI in Depth

Anthropic's Constitutional AI (CAI) uses a two-phase approach. In the supervised learning from AI feedback (SL-CAF) phase, the model generates responses to harmful prompts, then critiques and revises them according to a set of principles (the "constitution"), and is trained on the revised responses. In the RL from AI feedback (RLAIF) phase, the AI uses the constitution to provide preference labels for pairs of responses, replacing expensive human annotation. The constitution can articulate nuanced ethical principles that are difficult to convey through preference data alone.

Evaluation and Red-Teaming

Alignment evaluation requires systematic red-teaming — actively attempting to elicit harmful, deceptive, or unsafe behaviour from aligned models. Red-teaming approaches include: manual red-teaming by domain experts (the most effective but expensive); automated red-teaming using adversarial LLMs to generate jailbreaking prompts at scale; constitutional red-teaming where the model critiques its own responses; and benchmark suites like TruthfulQA (accuracy on misleading questions), HarmBench (harmfulness elicitation), and SafetyBench (multilingual safety).

Production Practice: Alignment is a continuous process, not a deployment milestone. Production models require: (1) systematic red-teaming before each release; (2) ongoing monitoring for policy violations and edge cases in production traffic; (3) a feedback loop for annotating and adding problematic examples to alignment training data; (4) a clear process for emergency model updates when a new failure mode is discovered. Budget for alignment engineering as a permanent team function, not a one-time project.

Multi-Task LoRA Serving Architecture

LoRA adapters can be combined using several techniques. Model merging (SLERP, TIES-Merging, DARE) linearly interpolates or combines the weights of multiple fine-tuned models, producing a merged model with averaged capabilities — useful for combining domain knowledge (e.g., a medical adapter + a coding adapter) without multi-task training. LoRA composition stacks multiple LoRA adapters during inference, applying each in sequence. Task vectors represent fine-tuning as a direction in weight space that can be added, subtracted, or scaled to compose capabilities.

The multi-task LoRA serving pattern is particularly powerful for organisations with many specialised use cases: a single large base model (e.g., Llama 3 70B) is kept in GPU VRAM, with a library of task-specific LoRA adapters that are hot-swapped per request. Each adapter adds only a few tens of megabytes of overhead. A request routing layer selects the appropriate adapter based on task type, user segment, or explicit selection. This architecture reduces total GPU footprint dramatically compared to serving multiple full fine-tuned models.

Evaluating Fine-tuned Models in Production

Production fine-tuning evaluation requires more than offline benchmarks. A robust evaluation suite for a domain-adapted LLM typically includes: held-out task accuracy on a representative sample of the target task; regression testing on general instruction following benchmarks (MT-Bench, MMLU) to detect capability degradation; latency and throughput profiling at production batch sizes; human preference evaluation comparing the fine-tuned model to the base model and to existing solutions; and adversarial input testing for the specific safety concerns of the domain.

A common production failure mode: fine-tuning improves task-specific accuracy while degrading general instruction following. This manifests as a model that performs well on the narrow target task but fails on adjacent requests from production users. The solution is diverse evaluation — include both task-specific and general evaluations, and treat regression on general benchmarks as a blocker for deployment.

Production Pattern

LoRA Adapter Hot-Swap Serving

A fintech company serves 18 different regulatory compliance tasks (GDPR, PCI-DSS, SOX, local banking regulations per jurisdiction) using a single Llama 3 70B base model hosted on 4× A100 80GB. Each compliance domain has its own LoRA adapter (rank 16, ~80MB each). The router examines the request's jurisdiction and compliance type and loads the appropriate adapter. Average adapter swap time: 40ms (VRAM-resident). Total GPU memory for 18 adapters + base model: ~168GB vs ~1,400GB if each were a separate full model. The base model stays loaded permanently; adapters are cached in a weighted LRU cache based on request frequency.

LoRA Serving Multi-Task Production Architecture

Exercises

These exercises are designed to move you from conceptual understanding to hands-on implementation. Work through them in order — each builds on the previous.

Beginner

Exercise 1: Exploring LoRA Rank Trade-offs

Using HuggingFace PEFT, load Mistral-7B-v0.1 and create four LoRA configurations with ranks 4, 8, 16, and 32, targeting q_proj and v_proj. For each configuration, call model.print_trainable_parameters() and record the number of trainable parameters. Plot trainable parameter count vs. LoRA rank. What does this relationship tell you about the quality vs. efficiency trade-off? At what rank would you expect diminishing returns on a typical instruction tuning task?

PEFT LoRA Parameter Analysis
Intermediate

Exercise 2: Building a Reward Model from Preference Pairs

Collect 50 preference pairs (chosen vs. rejected responses) on a domain task such as email writing, code explanation, or customer support. Split into 40 training pairs and 10 test pairs. Fine-tune a small model (e.g., GPT-2 or a 1.3B model) with a scalar output head using the Bradley-Terry cross-entropy loss. Evaluate your reward model on the held-out 10 pairs: what accuracy do you achieve (i.e., what fraction of the time does the RM correctly score the chosen response higher)? What are the most common failure modes?

Reward Model RLHF Preference Data
Advanced

Exercise 3: End-to-End LoRA Fine-Tuning with Evaluation

Fine-tune a 7B model (Mistral-7B or Llama 3 8B) with LoRA rank 16 on a custom instruction dataset of 100+ examples in a domain of your choice (legal, medical, technical documentation, etc.). Create 20 held-out test prompts that were not in your training set. Compare the fine-tuned model's responses to the base model's responses on all 20 prompts. Qualitatively assess improvement across four dimensions: relevance, format adherence, domain accuracy, and response length calibration. Report what your fine-tuning achieved and where it fell short.

LoRA Custom Dataset Evaluation

Fine-tuned Model Card Generator

Model cards are the standard documentation artifact for any fine-tuned model — capturing intended use, training data, performance characteristics, and known limitations. Generate a complete model card for your fine-tuned model using the form below.

Fine-tuned Model Card Generator

Conclusion & Next Steps

The four-layer adaptation stack covers the full journey from a raw pre-trained model to a production-ready AI assistant: full fine-tuning for large distribution shifts with abundant data, LoRA/QLoRA for efficient task adaptation with modest data budgets, instruction tuning (SFT) for teaching behavioural format and instruction following, and RLHF/DPO for aligning model values with human preferences and safety requirements.

A key insight running through this article: alignment is not a one-time checkbox. Production models require continuous evaluation, periodic red-teaming, and scheduled re-alignment as the world changes and the model's failure modes become better understood. The most sophisticated alignment pipeline in the world can be undone by data drift, adversarial users, or unanticipated edge cases in deployment.

The field is evolving rapidly. Constitutional AI and RLAIF are reducing the cost of alignment by replacing expensive human preference annotation with LLM-generated feedback. The DPO family (IPO, SimPO, KTO, ORPO) is simplifying the alignment pipeline further. Scalable oversight techniques (debate, process reward models, automated red-teaming) are being developed to align models that are more capable than the humans supervising them — the core challenge for the next decade of AI development.

Next in the Series

In Part 11: Generative AI Applications, we move from alignment to creation — covering diffusion models, GANs, VAEs, and the real-world applications transforming creative industries. We'll explore Stable Diffusion, DALL-E 3, text-to-audio, and video generation with production-ready code examples.

Technology