13 Training objectives
After completing this chapter, you will be able to:
- Define the causal language modeling objective and compute perplexity
- Explain masked language modeling and why BERT uses it
- Describe instruction tuning and its role in making models follow directions
- Understand RLHF: reward models, policy optimization, and KL constraints
- Compare different training objectives and their effects on model behavior
The transformer architecture defines how information flows through the network. But to learn useful weights, we need a training objective: a mathematical function that measures how well the model performs and provides gradients for improvement. This chapter covers the primary training objectives used for transformer models.
13.1 Language modeling
The most fundamental objective for decoder-style transformers is language modeling: predicting the next token given previous tokens.
13.1.1 The objective
Given a sequence of tokens \(x_1, x_2, \ldots, x_T\), the model learns to predict each token from its predecessors:
\[ \mathcal{L}_{LM} = -\sum_{t=1}^{T} \log P(x_t | x_1, \ldots, x_{t-1}) \]
At each position \(t\), the model outputs a probability distribution over the vocabulary. We measure how much probability mass lands on the correct token \(x_t\) using the negative log-likelihood. Lower loss means the model assigns higher probability to the true next tokens.
13.1.2 Why this works
Language modeling is self-supervised: the training signal comes from the data itself, requiring no manual labels. Given the sentence “The cat sat on the”, the model learns that “mat” is a likely continuation while “elephant” is not. Through millions of such predictions, the model learns syntax, semantics, facts, and reasoning patterns.
The causal masking in decoder attention ensures the model cannot cheat by looking at future tokens. Position \(t\) can only attend to positions \(1, \ldots, t-1\), so the prediction \(P(x_t | x_1, \ldots, x_{t-1})\) depends only on legitimate context.
13.1.3 Perplexity
We often report perplexity instead of raw loss:
\[ \text{PPL} = \exp\left(\frac{1}{T} \sum_{t=1}^{T} -\log P(x_t | x_1, \ldots, x_{t-1})\right) = \exp(\mathcal{L}_{LM} / T) \]
Perplexity has an intuitive interpretation: it measures how “surprised” the model is by the data. A perplexity of 10 means the model is, on average, as uncertain as if it were choosing uniformly among 10 options at each position. Lower perplexity indicates better predictions.
13.2 Masked language modeling
For encoder-style transformers like BERT (Devlin et al. 2018), we use masked language modeling (MLM): predicting randomly masked tokens from bidirectional context.
13.2.1 The objective
Given a sequence, we randomly select 15% of tokens for prediction. For selected positions, we replace the token with [MASK] 80% of the time, a random token 10% of the time, or keep it unchanged 10% of the time. The model predicts the original tokens:
\[ \mathcal{L}_{MLM} = -\sum_{t \in \mathcal{M}} \log P(x_t | \mathbf{x}_{\backslash t}) \]
where \(\mathcal{M}\) is the set of masked positions and \(\mathbf{x}_{\backslash t}\) denotes all tokens except position \(t\) (which the model sees as [MASK] or corrupted).
13.2.2 Bidirectional context
Unlike causal language modeling, MLM uses bidirectional attention. When predicting the masked token in “The [MASK] sat on the mat”, the model can use both “The” (left context) and “sat on the mat” (right context). This bidirectional understanding is valuable for tasks like classification and question answering where we have complete input sequences.
13.2.3 The masking strategy
The 80/10/10 split serves specific purposes. Using [MASK] 80% of the time teaches the model to predict from context. Using random tokens 10% of the time prevents the model from assuming [MASK] always means “predict here”. Keeping original tokens 10% of the time teaches the model that unmasked positions might still need prediction, which helps during fine-tuning when there are no [MASK] tokens.
13.3 Next sentence prediction
BERT introduced a secondary objective: predicting whether two sentences are consecutive in the original text.
13.3.1 The objective
Given sentence pair (A, B), predict whether B actually followed A in the corpus:
\[ \mathcal{L}_{NSP} = -\log P(\text{IsNext} | \text{[CLS]}, A, \text{[SEP]}, B) \]
The model uses the [CLS] token representation to make a binary classification. 50% of training pairs are true consecutive sentences, 50% are random pairs.
13.3.2 Limitations
Later research showed NSP provides limited benefit and can even hurt performance. The task is too easy: distinguishing random sentences often reduces to topic detection rather than understanding coherence. Models like RoBERTa (Liu et al. 2019) dropped NSP entirely and achieved better results with just MLM.
13.4 Causal language modeling at scale
GPT-style models (Radford et al. 2018) use pure causal language modeling but at massive scale. The key insight from GPT-2 (Radford et al. 2019) and GPT-3 (Brown et al. 2020) is that a sufficiently large language model trained on diverse text becomes a general-purpose system.
13.4.1 Emergent capabilities
As models scale, they develop capabilities not explicitly trained:
- Zero-shot learning: Performing tasks from instructions alone
- Few-shot learning: Learning new tasks from a handful of examples in context
- Chain-of-thought reasoning: Breaking complex problems into steps
These emerge from the language modeling objective because predicting text requires understanding the underlying concepts, relationships, and reasoning patterns.
13.4.2 The training data
Modern language models train on web-scale corpora: hundreds of billions of tokens from books, websites, code repositories, and other sources. Data quality and diversity matter enormously. Filtering for quality, deduplicating, and balancing domains all improve downstream performance.
13.5 Instruction tuning
A language model trained purely on next-token prediction learns to complete text, but completing text is not the same as following instructions. Given the prompt “What is the capital of France?”, a pretrained model might continue with “is a common geography question” or “The capital of Germany is Berlin” because these are plausible text continuations. The model has no notion that it should answer the question.
Instruction tuning bridges this gap. We fine-tune the pretrained model on examples of instructions paired with appropriate responses, teaching it to interpret prompts as requests and generate helpful outputs.
13.5.1 The data format
Each training example consists of an instruction (or prompt) and a response:
Instruction: Summarize the following article in three sentences. [Article text…]
Response: The article discusses… Key findings include… The authors conclude…
Instructions vary widely: “Translate this to Spanish”, “Write a Python function that sorts a list”, “Explain quantum entanglement to a 10-year-old”, “What are the pros and cons of solar energy?”. The diversity matters. A model trained only on translation instructions would not learn to answer questions. Broad coverage across task types produces a general-purpose assistant.
The response demonstrates the desired behavior. For a summarization instruction, the response is a good summary. For a coding instruction, the response is working code. The model learns by example what constitutes an appropriate response to each type of instruction.
13.5.2 The objective
Instruction tuning uses the same cross-entropy loss as language modeling, but with a crucial modification: we only compute loss on the response tokens.
Given an instruction \(x = (x_1, \ldots, x_n)\) and response \(r = (r_1, \ldots, r_m)\), we concatenate them and feed the sequence through the model. The loss is:
\[ \mathcal{L}_{IT} = -\sum_{t=1}^{m} \log P(r_t | x_1, \ldots, x_n, r_1, \ldots, r_{t-1}) \]
Notice the sum runs only over response positions \(t = 1, \ldots, m\). We do not penalize the model for its predictions on instruction tokens. Why? The instruction is given by the user at inference time; the model does not need to generate it. We care only that the model produces good responses given instructions, not that it can predict instruction text.
Mathematically, this is implemented by masking the loss. We compute predictions for all positions but multiply losses by zero for instruction tokens:
\[ \mathcal{L}_{IT} = -\sum_{t=1}^{n+m} \text{mask}_t \cdot \log P(\text{token}_t | \text{token}_1, \ldots, \text{token}_{t-1}) \]
where \(\text{mask}_t = 0\) for instruction positions and \(\text{mask}_t = 1\) for response positions.
13.5.3 Where does the data come from?
Instruction-tuning datasets are constructed through several approaches:
Human-written examples: Contractors or researchers write instructions and high-quality responses. This produces reliable data but is expensive and slow. Thousands of examples might cost tens of thousands of dollars.
Crowdsourcing: Platforms like Amazon Mechanical Turk collect instructions and responses from many workers. Quality varies, requiring filtering and validation. Larger scale is possible but noise increases.
Existing datasets reformatted: Many NLP datasets can be converted to instruction format. A sentiment classification dataset becomes “Classify the sentiment of this review as positive or negative: [review]” with the label as the response. A translation corpus becomes “Translate to French: [English text]”. This provides large-scale data cheaply but may not cover conversational or open-ended instructions well.
Synthetic data from larger models: A powerful model generates responses to instructions, and these are used to train smaller models. This is called distillation. The student model learns to imitate the teacher’s behavior. Quality depends on the teacher model, and there are legal and ethical considerations around using model outputs as training data.
User interactions: Deployed systems can collect real user instructions (with consent). This captures what users actually want, which may differ from what dataset creators imagine. Responses may come from human operators or be filtered from model outputs.
In practice, instruction-tuning datasets combine multiple sources. A dataset might include 10,000 human-written examples for quality, 100,000 reformatted NLP examples for coverage, and 50,000 synthetic examples for scale.
13.5.4 What changes during instruction tuning?
Pretraining produces a model that assigns probability to text. Instruction tuning reshapes this distribution. Before tuning, \(P(\text{"Paris"} | \text{"What is the capital of France?"})\) might be low because the model expects the text to continue as a document, not as an answer. After tuning, this probability increases because the model has seen thousands of question-answer pairs.
The model learns several things:
- Response format: Answers should be direct, not continuations of the question
- Task recognition: Different instruction patterns (translate, summarize, explain) require different response types
- Helpfulness: Responses should address what the user asked, not tangentially related content
- Refusal: Some instructions should be declined (harmful requests, impossible tasks)
The weights change throughout the network, but the changes are typically small relative to pretraining. We start from a capable language model and nudge it toward instruction-following behavior. This is why instruction tuning requires far less data than pretraining: we are refining existing capabilities, not building them from scratch.
13.5.5 Practical considerations
Learning rate: Instruction tuning uses smaller learning rates than pretraining, often \(10^{-5}\) to \(10^{-6}\) compared to \(10^{-4}\) for pretraining. Large learning rates would destroy the knowledge acquired during pretraining.
Epochs: We typically train for 1-3 epochs over the instruction data. More epochs risk overfitting to the specific phrasing of training instructions.
Parameter-efficient fine-tuning: Instead of updating all weights, methods like LoRA (Low-Rank Adaptation) freeze most parameters and train small adapter modules. This reduces memory requirements and preserves more of the pretrained knowledge. The adapters learn the instruction-following behavior while the base model remains unchanged.
Chat format: Many instruction-tuned models use a specific format with special tokens marking roles:
<|user|>What is the capital of France?<|assistant|>The capital of France is Paris.
The model learns to generate text following <|assistant|> given context containing <|user|> messages. Multi-turn conversations extend this with alternating user and assistant blocks.
13.6 Fine-tuning
Fine-tuning adapts a pretrained model to a specific task, domain, or behavior. Rather than training from scratch, we start with weights that already encode general language understanding and adjust them for our particular needs. This is one of the most practically important techniques in modern NLP.
13.6.1 Why fine-tuning works
A model pretrained on billions of tokens learns far more than next-token prediction. It learns:
- Syntax: How words combine into grammatical sentences
- Semantics: What words mean and how meanings compose
- World knowledge: Facts about entities, relationships, common sense
- Reasoning patterns: How to draw inferences, follow logic
- Discourse structure: How paragraphs and arguments flow
These capabilities transfer to new tasks. A model that understands English grammar does not need to relearn grammar for sentiment classification. A model that knows facts about the world can answer questions about those facts with minimal additional training.
The mathematics of transfer learning: pretraining finds weights \(\theta_{pre}\) that minimize loss on a large, general corpus. These weights define a point in parameter space. Fine-tuning starts from \(\theta_{pre}\) and moves to nearby weights \(\theta_{fine}\) that minimize loss on a smaller, task-specific dataset. Because \(\theta_{pre}\) already encodes useful structure, the optimization problem is easier: we are refining, not building from scratch.
Empirically, fine-tuning requires 100-10,000x less data than pretraining for equivalent task performance. A model pretrained on 1 trillion tokens might fine-tune effectively on 10,000 examples. This dramatic efficiency gain is why fine-tuning dominates practical applications.
13.6.2 The fine-tuning objective
Fine-tuning uses the same loss function as pretraining, just on different data:
\[ \mathcal{L}_{fine} = -\sum_{(x,y) \in \mathcal{D}_{task}} \log P_\theta(y | x) \]
For classification tasks, \(y\) is a class label and \(P_\theta(y|x)\) comes from a classification head added to the model. For generation tasks, \(y\) is a target sequence and we use the standard language modeling loss over output tokens.
The key differences from pretraining:
Smaller learning rate: We use learning rates 10-100x smaller than pretraining, typically \(10^{-5}\) to \(10^{-6}\). Large learning rates would destroy the pretrained knowledge. We want gentle updates that preserve most of what was learned while adjusting for the new task.
Fewer steps: Fine-tuning runs for thousands of steps, not millions. One to three passes over the fine-tuning data is typical. More risks overfitting to the small dataset.
Task-specific data: The fine-tuning dataset is much smaller but more focused. For sentiment classification, we might have 10,000 movie reviews with labels. For medical question answering, we might have 5,000 question-answer pairs from clinical sources.
13.6.3 Full fine-tuning
In full fine-tuning, we update the exact same weight matrices that were learned during pretraining. The model architecture does not change. We take the pretrained weights and adjust their numerical values.
Specifically, these are the weights being updated:
- Embedding matrix \(\mathbf{W}_E \in \mathbb{R}^{V \times d}\): Maps tokens to vectors
- Attention weights in each layer: \(\mathbf{W}_Q, \mathbf{W}_K, \mathbf{W}_V \in \mathbb{R}^{d \times d}\) and output projection \(\mathbf{W}_O\)
- MLP weights in each layer: \(\mathbf{W}_1 \in \mathbb{R}^{d \times 4d}\), \(\mathbf{W}_2 \in \mathbb{R}^{4d \times d}\)
- Layer normalization parameters: Scale and shift vectors
- Output projection \(\mathbf{W}_{out} \in \mathbb{R}^{d \times V}\): Maps back to vocabulary (often tied with \(\mathbf{W}_E\))
For a transformer with \(L\) layers, this is tens of billions of individual numbers, each one adjustable during fine-tuning.
The update rule is standard gradient descent:
\[ \theta_{fine} = \theta_{pre} - \eta \nabla_\theta \mathcal{L}_{fine}(\theta_{pre}) \]
repeated for multiple steps with learning rate \(\eta\). We start from the pretrained values \(\theta_{pre}\) and nudge each weight in the direction that reduces the task-specific loss.
What changes? After fine-tuning, the embedding for “plaintiff” might shift slightly in the embedding space. The attention weights might learn to attend more strongly to legal terminology. The MLP might adjust how it processes formal language. Every weight can change, and millions of small changes accumulate into different model behavior.
What stays the same? The architecture: number of layers, attention heads, hidden dimensions, activation functions. We are not adding or removing anything, just adjusting the numbers.
Advantages: Maximum flexibility to adapt to the new task. If the task requires very different representations than pretraining, full fine-tuning can make large changes.
Disadvantages: Requires storing a complete copy of the model for each task. A 70B parameter model takes 140GB in 16-bit precision. Ten tasks means 1.4TB of storage. Also, full fine-tuning can overfit quickly on small datasets since all parameters are free to change.
13.6.4 Catastrophic forgetting
When we fine-tune on task A, the model may lose capabilities it had before fine-tuning. This is catastrophic forgetting: the new gradients overwrite information stored in the weights.
Consider a model pretrained on general text, then fine-tuned on legal documents. After fine-tuning, it might excel at legal language but struggle with casual conversation or code. The legal gradients pushed weights away from configurations that supported those other capabilities.
Mathematically, the pretrained weights \(\theta_{pre}\) lie in a region of parameter space that performs well on many tasks. Fine-tuning moves to \(\theta_{fine}\), optimized for the specific task but potentially outside the good region for other tasks.
Mitigation strategies:
Lower learning rate: Smaller updates stay closer to \(\theta_{pre}\), preserving more general capability.
Early stopping: Stop fine-tuning before the model overfits to the task data. Monitor performance on held-out general benchmarks.
Regularization toward pretrained weights: Add a penalty \(\lambda ||\theta - \theta_{pre}||^2\) to the loss. This explicitly discourages moving far from the pretrained configuration.
Replay: Mix task-specific data with samples from the pretraining distribution. The model continues seeing general text while learning the specific task.
Parameter-efficient methods: Update only a small subset of parameters, leaving most frozen. This is the most effective approach and deserves detailed discussion.
13.6.5 Parameter-efficient fine-tuning
Instead of updating all parameters, we can freeze most of the pretrained model and train only a small number of additional or selected parameters. This family of techniques is called parameter-efficient fine-tuning (PEFT).
LoRA: Low-Rank Adaptation
LoRA is the most widely used PEFT method. The key insight: the weight updates during fine-tuning have low rank. We do not need full-rank updates to adapt a model.
The core idea: Instead of modifying the pretrained weight matrix \(\mathbf{W}\) directly, we add a small correction term to it. This correction is represented as the product of two new, smaller matrices that we create from scratch.
Consider a pretrained weight matrix \(\mathbf{W} \in \mathbb{R}^{d \times k}\). In full fine-tuning, we would update \(\mathbf{W}\) to \(\mathbf{W} + \Delta\mathbf{W}\), where \(\Delta\mathbf{W}\) is whatever change the gradients dictate. LoRA instead constrains \(\Delta\mathbf{W}\) to be low-rank by factorizing it:
\[ \Delta\mathbf{W} = \mathbf{B}\mathbf{A} \]
What are A and B? These are two brand-new matrices that do not exist in the pretrained model. We create them specifically for fine-tuning:
- \(\mathbf{A} \in \mathbb{R}^{r \times k}\): A “down-projection” matrix with \(r\) rows and \(k\) columns
- \(\mathbf{B} \in \mathbb{R}^{d \times r}\): An “up-projection” matrix with \(d\) rows and \(r\) columns
- \(r\) is the “rank,” a small number like 8, 16, or 64
The product \(\mathbf{B}\mathbf{A}\) has shape \(d \times k\), matching the original weight matrix \(\mathbf{W}\). But because we go through the bottleneck dimension \(r\), the product can only represent matrices of rank at most \(r\). This is the “low-rank” constraint.
Visualizing the dimensions: Suppose \(\mathbf{W}\) is a \(4096 \times 4096\) attention weight matrix and we use rank \(r = 16\):
| Matrix | Shape | Parameters | Status |
|---|---|---|---|
| Original \(\mathbf{W}\) | \(4096 \times 4096\) | 16.7 million | Frozen |
| \(\mathbf{A}\) | \(16 \times 4096\) | 65,536 | Trainable |
| \(\mathbf{B}\) | \(4096 \times 16\) | 65,536 | Trainable |
| \(\mathbf{B}\mathbf{A}\) | \(4096 \times 4096\) | (computed, not stored) | Low-rank update |
The input \(\mathbf{x}\) has dimension \(k = 4096\). The computation flows:
- \(\mathbf{A}\mathbf{x}\): Project from 4096 dimensions down to 16 dimensions
- \(\mathbf{B}(\mathbf{A}\mathbf{x})\): Project from 16 dimensions back up to 4096 dimensions
The bottleneck at 16 dimensions limits what transformations are possible, but this turns out to be enough for task adaptation.
The forward pass becomes:
\[ \mathbf{h} = \mathbf{W}\mathbf{x} + \mathbf{B}\mathbf{A}\mathbf{x} \]
The pretrained weights \(\mathbf{W}\) are frozen and never updated. Only \(\mathbf{A}\) and \(\mathbf{B}\) receive gradients and change during training. The original model is untouched; we are learning a small correction on top of it.
Initialization:
- \(\mathbf{A}\) is initialized with small random values (Gaussian, variance \(1/r\))
- \(\mathbf{B}\) is initialized to all zeros
Because \(\mathbf{B} = 0\) initially, the product \(\mathbf{B}\mathbf{A} = 0\), so \(\mathbf{h} = \mathbf{W}\mathbf{x}\) at the start. The model begins exactly at pretrained behavior. As training proceeds, \(\mathbf{B}\) grows away from zero, and the LoRA correction gradually takes effect.
Parameter count: For a weight matrix of size \(d \times k\), full fine-tuning has \(dk\) trainable parameters. LoRA with rank \(r\) has \(r \times k + d \times r = r(d + k)\) trainable parameters. With \(d = k = 4096\) and \(r = 16\):
- Full: \(4096 \times 4096 = 16.7M\) parameters per matrix
- LoRA: \(16 \times 4096 + 4096 \times 16 = 131K\) parameters per matrix
A 127x reduction per matrix. Across all attention matrices in a large model, LoRA typically adds 0.1-1% of the base model’s parameters.
Scaling factor: In practice, the LoRA update is scaled:
\[ \mathbf{h} = \mathbf{W}\mathbf{x} + \frac{\alpha}{r}\mathbf{B}\mathbf{A}\mathbf{x} \]
where \(\alpha\) is a hyperparameter (often set equal to \(r\) so the factor is 1, or tuned separately). This scaling helps with learning rate tuning across different ranks.
Which layers get LoRA? We choose which weight matrices in the model to augment with LoRA. Common choices:
- \(\mathbf{W}_Q\) and \(\mathbf{W}_V\) (query and value projections in attention): Most common, good results
- Adding \(\mathbf{W}_K\) and \(\mathbf{W}_O\) (key and output projections): More capacity
- Adding MLP weights: Even more capacity, but diminishing returns
Each matrix we apply LoRA to gets its own pair of \(\mathbf{A}\) and \(\mathbf{B}\) matrices. If we apply LoRA to \(\mathbf{W}_Q\) and \(\mathbf{W}_V\) in each of 32 layers, we create \(32 \times 2 = 64\) pairs of LoRA matrices.
After training: We can merge the LoRA weights back into the base model:
\[ \mathbf{W}_{merged} = \mathbf{W} + \mathbf{B}\mathbf{A} \]
Now we have a single weight matrix with no inference overhead. The model runs at normal speed with the task-specific behavior baked in. Or we can keep them separate, allowing us to swap different LoRA adaptations in and out of the same base model.
Why low-rank works: Fine-tuning for a specific task does not require changing everything the model knows. It requires adjusting how existing knowledge is accessed and combined. These adjustments lie in a low-dimensional subspace of the full parameter space.
Think of it this way: the pretrained model already knows about language, facts, and reasoning. To adapt it to legal documents, we do not need to relearn English. We need to adjust attention patterns to focus on legal terminology and shift how the model weighs certain features. These adjustments are relatively simple transformations of what already exists, hence low-rank.
Empirically, rank 8-64 suffices for most tasks. This suggests the “task-specific adjustment space” is indeed low-dimensional, perhaps only a few dozen independent directions in the million-dimensional parameter space.
Adapters
Adapters insert small trainable modules between frozen layers. A typical adapter has:
- Down-projection: \(\mathbf{W}_{down} \in \mathbb{R}^{d \times r}\) reducing dimension from \(d\) to \(r\)
- Nonlinearity: ReLU or GELU
- Up-projection: \(\mathbf{W}_{up} \in \mathbb{R}^{r \times d}\) restoring dimension
- Residual connection: add the adapter output to the input
\[ \text{Adapter}(\mathbf{x}) = \mathbf{x} + \mathbf{W}_{up} \cdot \text{ReLU}(\mathbf{W}_{down} \cdot \mathbf{x}) \]
Adapters are placed after attention and/or MLP sublayers. The residual connection means that with zero-initialized up-projection, the adapter initially has no effect.
Comparison to LoRA: Adapters add sequential computation (extra matrix multiplies in the forward pass). LoRA modifies existing computations (the \(\mathbf{W} + \mathbf{B}\mathbf{A}\) can be merged at inference). LoRA is generally preferred for its inference efficiency: after training, merge \(\Delta\mathbf{W}\) into \(\mathbf{W}\) and the model runs at normal speed. Adapters always add overhead.
Prefix tuning
Prefix tuning prepends learnable “virtual tokens” to the input. Instead of modifying weights, we modify the input:
\[ \text{Input} = [\mathbf{p}_1, \mathbf{p}_2, \ldots, \mathbf{p}_m, \mathbf{x}_1, \mathbf{x}_2, \ldots, \mathbf{x}_n] \]
where \(\mathbf{p}_i\) are trainable prefix embeddings and \(\mathbf{x}_j\) are the actual input tokens. The prefix embeddings are learned per-task while all model weights remain frozen.
The prefix acts as a task-specific context that steers the model’s behavior. For summarization, the prefix might implicitly encode “summarize the following.” For translation, it might encode “translate to French.”
Soft prompts vs. hard prompts: Hard prompts are actual text tokens (“Summarize:”). Soft prompts are continuous vectors that do not correspond to real tokens. Soft prompts can be optimized directly, potentially finding better “prompts” than any expressible in natural language.
Parameter count: With prefix length \(m\) and embedding dimension \(d\), prefix tuning adds \(m \times d\) parameters (plus some overhead for deep prefix tuning that adds prefixes at every layer). This is very few parameters, but the method is less expressive than LoRA for some tasks.
13.6.6 Choosing a fine-tuning approach
When should you use each approach?
Full fine-tuning when: - You have abundant task-specific data (100K+ examples) - The task is very different from pretraining (e.g., a new language) - You only need one task and can afford the storage - Maximum performance matters more than efficiency
LoRA when: - You have moderate data (1K-100K examples) - You need multiple task-specific versions of the same base model - You want to preserve general capabilities while adding task skills - Inference efficiency matters (merged weights have no overhead)
Adapters when: - You want modular, swappable task capabilities - You are experimenting with many tasks and want easy comparison - The overhead of adapter forward passes is acceptable
Prefix tuning when: - You have very little data (100-1000 examples) - The task is a variation of something the model already does well - You want the smallest possible parameter footprint
No fine-tuning (prompting only) when: - You have essentially no task-specific data - The base model already performs adequately with good prompts - You need zero additional training or storage
13.6.7 The fine-tuning landscape
Modern LLM deployment typically involves multiple stages of fine-tuning:
- Pretraining: Massive compute, general text, language modeling objective
- Continued pretraining (optional): More compute on domain-specific text (medical, legal, code)
- Instruction tuning: Moderate compute, instruction-following data
- Task-specific fine-tuning: Smaller compute, specific application data
- RLHF/DPO: Alignment with human preferences
Each stage can use full fine-tuning or PEFT methods. A common pattern: - Full fine-tuning for instruction tuning (significant behavior change needed) - LoRA for task-specific adaptation (preserve instruction-following while adding task skill)
The pretrained model is a foundation. Each fine-tuning stage builds on it, specializing capabilities while (hopefully) preserving general competence. Understanding this landscape helps practitioners choose where and how to intervene for their specific needs.
13.7 Reinforcement learning from human feedback
RLHF further aligns models with human preferences using reinforcement learning (Ouyang et al. 2022). The core problem is this: language modeling teaches a model to predict text, but predicting text is not the same as being helpful. A model trained purely on web text will happily generate toxic content, confidently state falsehoods, or ignore the user’s actual intent. RLHF incorporates human judgment about what makes a response good, beyond statistical likelihood.
The process has three stages, each with different models being trained:
Collect preferences: Sample responses from the current LLM, have humans compare them. No training happens here.
Train reward model: Create a separate model (initialized from the same pretrained weights but with a scalar output head). Train this reward model on the preference data. The LLM weights are frozen during this phase.
Optimize policy: Now train the LLM to produce high-reward responses. The reward model weights are frozen; only the LLM weights update.
The key point: the reward model and the language model are separate networks. We never train them simultaneously. The reward model learns to score responses; then, with the reward model fixed, we update the language model to generate better responses according to those scores.
13.7.1 Collecting preference data
We start with a pretrained and instruction-tuned language model. Given a prompt \(x\), we sample multiple responses \(y_1, y_2, \ldots, y_k\) from the model. Human annotators then compare pairs of responses and indicate which is better. For a prompt like “Explain quantum entanglement to a child”, one response might use clear analogies while another might use jargon. The annotator marks the clearer one as preferred.
This comparison format is crucial. Asking humans to score responses on an absolute scale (1-10) produces inconsistent ratings. Different annotators calibrate differently, and even the same annotator varies over time. But comparisons are robust: given two responses side by side, humans reliably identify which is better, even when they cannot articulate exactly why.
The result is a dataset of triples \((x, y_w, y_l)\): prompt, winning response, losing response. Thousands of such comparisons capture nuanced human preferences about helpfulness, accuracy, safety, and style.
13.7.2 The reward model
We train a reward model \(r_\phi(x, y)\) that takes a prompt and response and outputs a scalar score. This is a separate network from the language model we ultimately want to improve. Architecturally, the reward model is typically a transformer initialized from the same pretrained weights as the language model, but we remove the language modeling head (which outputs vocabulary probabilities) and add a linear layer that outputs a single number. The final token’s representation passes through this layer to produce the reward.
During this phase, we only update the reward model parameters \(\phi\). The language model that generated the responses remains frozen. We are not yet improving the language model; we are building a tool (the reward model) that will guide improvements in the next phase.
The reward model learns from preference comparisons using the Bradley-Terry model. This statistical model, developed in the 1950s for ranking chess players and sports teams, provides a principled way to convert pairwise comparisons into numerical scores.
13.7.3 The Bradley-Terry model
The core idea is simple: each item has an underlying “strength” and the probability of one item beating another depends on their relative strengths. In the original formulation, if item \(i\) has strength \(p_i > 0\) and item \(j\) has strength \(p_j > 0\), then the probability that \(i\) beats \(j\) is:
\[ P(i \succ j) = \frac{p_i}{p_i + p_j} \]
This formula has intuitive properties. If \(p_i = p_j\), the probability is \(\frac{1}{2}\): equally matched items have equal chances. If \(p_i = 2p_j\), then \(P(i \succ j) = \frac{2}{3}\): item \(i\) is twice as “strong” and wins two-thirds of the time. As \(p_i \to \infty\) relative to \(p_j\), the probability approaches 1.
We can reparametrize by taking logarithms. Let \(r_i = \log p_i\). Then:
\[ P(i \succ j) = \frac{p_i}{p_i + p_j} = \frac{e^{r_i}}{e^{r_i} + e^{r_j}} = \frac{1}{1 + e^{-(r_i - r_j)}} = \sigma(r_i - r_j) \]
The sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\) emerges naturally. The probability of winning depends only on the difference in log-strengths \(r_i - r_j\). This is why we call these log-strengths “rewards” in RLHF: they are scores on a scale where differences determine win probabilities.
Figure 13.1 shows this relationship. The sigmoid curve transforms any score difference into a probability between 0 and 1. The curve is steepest around zero, meaning small score differences produce noticeable probability changes. Far from zero, the curve flattens: whether one response scores 10 points higher or 100 points higher, it will win with near-certainty either way.
A concrete example: Suppose for a given prompt, response A has reward \(r_A = 1.5\) and response B has reward \(r_B = -0.5\). The difference is \(r_A - r_B = 2.0\). The Bradley-Terry model predicts:
\[ P(A \succ B) = \sigma(2.0) = \frac{1}{1 + e^{-2}} \approx 0.88 \]
Response A wins 88% of the time. If we collected many comparisons between these responses, we would expect A to be preferred in roughly 88 out of 100 comparisons.
13.7.4 Training the reward model
For RLHF, we apply Bradley-Terry to response pairs. If response \(y_w\) is preferred over \(y_l\) for prompt \(x\), the model should satisfy:
\[ P(y_w \succ y_l | x) = \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \]
We train by maximizing the log-likelihood of observed preferences:
\[ \mathcal{L}_{RM} = -\mathbb{E}_{(x, y_w, y_l) \sim \mathcal{D}} \left[ \log \sigma(r_\phi(x, y_w) - r_\phi(x, y_l)) \right] \]
Expanding the logarithm of the sigmoid:
\[ \log \sigma(z) = \log \frac{1}{1 + e^{-z}} = -\log(1 + e^{-z}) \]
When \(z\) is large and positive (winner has much higher reward), \(e^{-z} \approx 0\), so \(\log \sigma(z) \approx 0\): low loss. When \(z\) is negative (loser has higher reward), \(e^{-z}\) is large, and the loss grows. The gradient pushes the model to increase \(r_\phi(x, y_w)\) and decrease \(r_\phi(x, y_l)\).
Note that only the difference \(r_\phi(x, y_w) - r_\phi(x, y_l)\) matters. We can add any constant to all rewards without changing the loss. This makes the absolute scale of rewards arbitrary, which is fine since we only need to rank responses, not assign meaningful scores.
What does the reward model learn? It learns a compressed representation of human preferences. When annotators consistently prefer responses that are accurate, the reward model learns to score accurate responses higher. When they prefer concise answers for simple questions, the reward model captures this. The reward model distills thousands of individual judgments into a function that generalizes to new prompts and responses the annotators never saw.
Counterexample: Consider training a reward model on comparisons where annotators only care about response length, preferring shorter answers. The reward model would learn \(r_\phi(x, y) \approx -\text{len}(y)\), assigning higher scores to shorter responses regardless of content. This is technically correct given the training signal, but useless for alignment. The quality of RLHF depends entirely on what preferences the annotators express.
13.7.5 Policy optimization
Now we have a trained reward model that scores responses. In this final phase, we update the language model (called the policy, denoted \(\pi_\theta\)) to produce high-reward responses. The reward model parameters \(\phi\) are now frozen; we only update the language model parameters \(\theta\). This is a reinforcement learning problem: the model takes actions (generating tokens), receives rewards (from the reward model), and must learn a policy that maximizes expected reward.
Given prompt \(x\), the model generates response \(y\) by sampling tokens autoregressively: \(y_t \sim \pi_\theta(\cdot | x, y_{<t})\). The complete response receives reward \(r_\phi(x, y)\). We want to maximize:
\[ J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}[r_\phi(x, y)] \]
But maximizing reward alone is dangerous. The reward model is an imperfect proxy for human preferences. If we optimize too aggressively, the policy finds responses that score high according to the reward model but are actually low quality. This is called reward hacking. For example, the reward model might have learned that confident-sounding responses tend to be preferred. The policy could exploit this by generating responses that sound extremely confident regardless of accuracy.
To prevent reward hacking, we add a constraint: the policy should not deviate too far from the original model (the reference policy \(\pi_{ref}\), typically the instruction-tuned model before RLHF). We measure deviation using KL divergence:
\[ \text{KL}(\pi_\theta || \pi_{ref}) = \mathbb{E}_{y \sim \pi_\theta} \left[ \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} \right] \]
The full RLHF objective becomes:
\[ J(\theta) = \mathbb{E}_{x \sim \mathcal{D}, y \sim \pi_\theta(\cdot|x)}\left[r_\phi(x, y) - \beta \cdot \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)}\right] \]
The hyperparameter \(\beta\) controls the trade-off. Large \(\beta\) keeps the policy close to the reference, limiting reward but preventing divergence. Small \(\beta\) allows more aggressive optimization, risking reward hacking. In practice, \(\beta\) is tuned empirically, often starting around 0.01-0.1.
We can rewrite this as maximizing a modified reward:
\[ \tilde{r}(x, y) = r_\phi(x, y) - \beta \cdot \log \frac{\pi_\theta(y|x)}{\pi_{ref}(y|x)} \]
The KL term acts as a penalty on responses that the reference model would find unlikely. If \(\pi_\theta(y|x) \gg \pi_{ref}(y|x)\), the log ratio is large and positive, reducing the effective reward. This prevents the policy from drifting into regions of response space where the reference model assigns low probability, which are often degenerate or exploitative.
13.7.6 Proximal policy optimization
We optimize this objective using Proximal Policy Optimization (PPO), a reinforcement learning algorithm designed for stable training. The challenge is that the objective involves an expectation over samples from the current policy. As we update \(\theta\), the sampling distribution changes, which can cause instability.
PPO addresses this by limiting how much the policy can change in each update. Let \(r_t(\theta) = \frac{\pi_\theta(y_t|x, y_{<t})}{\pi_{\theta_{old}}(y_t|x, y_{<t})}\) be the probability ratio between new and old policies for token \(t\). PPO clips this ratio:
\[ \mathcal{L}_{PPO} = -\mathbb{E}\left[\min\left(r_t(\theta) \hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t\right)\right] \]
where \(\hat{A}_t\) is the advantage estimate (how much better this action was than expected) and \(\epsilon\) is a small constant like 0.2. The clipping prevents the ratio from moving too far from 1, ensuring gradual policy updates.
Computing advantages requires estimating value functions and handling the credit assignment problem: which tokens in the response contributed to the final reward? This involves training a value network alongside the policy, adding complexity. Full PPO implementations for RLHF are intricate, but the core idea is simple: take small steps, clip large changes, and gradually improve the policy.
13.7.7 Direct preference optimization
An alternative called Direct Preference Optimization (DPO) (Rafailov et al. 2023) sidesteps reinforcement learning entirely. The insight is that the optimal policy under the RLHF objective has a closed form:
\[ \pi^*(y|x) = \frac{1}{Z(x)} \pi_{ref}(y|x) \exp\left(\frac{1}{\beta} r_\phi(x, y)\right) \]
where \(Z(x)\) is a normalizing constant. Rearranging, the reward can be expressed in terms of policies:
\[ r_\phi(x, y) = \beta \log \frac{\pi^*(y|x)}{\pi_{ref}(y|x)} + \beta \log Z(x) \]
Substituting into the Bradley-Terry preference model and simplifying (the \(Z(x)\) terms cancel), we get a loss that directly optimizes the policy on preference data:
\[ \mathcal{L}_{DPO} = -\mathbb{E}_{(x, y_w, y_l)} \left[ \log \sigma\left(\beta \log \frac{\pi_\theta(y_w|x)}{\pi_{ref}(y_w|x)} - \beta \log \frac{\pi_\theta(y_l|x)}{\pi_{ref}(y_l|x)}\right) \right] \]
This looks complex but is straightforward to implement: compute log-probabilities under the current policy and reference policy for both responses, combine them as shown, and backpropagate. No reward model, no RL, no value functions. DPO achieves comparable results to PPO-based RLHF with simpler training.
13.7.8 Why RLHF matters
Language modeling optimizes \(P(\text{text})\), the probability of text appearing in the training corpus. But we want models that are helpful, accurate, and safe. These properties correlate with probability (helpful text exists in training data) but are not the same. RLHF directly optimizes for human preferences, bridging the gap between “likely text” and “good text”.
The reward model captures preferences that would be difficult to specify explicitly. How do you write down a loss function for “explains concepts clearly” or “acknowledges uncertainty appropriately”? You cannot, but you can collect examples where humans prefer one response over another, and the reward model learns the pattern.
RLHF is not without limitations. The preferences come from a specific group of annotators who may not represent all users. The reward model can be wrong. Optimization can find loopholes. But RLHF has proven essential for making language models useful in practice, transforming them from text predictors into assistants that follow instructions and avoid harmful outputs.
13.8 Comparing objectives
| Objective | Architecture | Context | Primary use |
|---|---|---|---|
| Causal LM | Decoder | Unidirectional | Text generation |
| Masked LM | Encoder | Bidirectional | Understanding tasks |
| Instruction tuning | Decoder | Unidirectional | Following instructions |
| RLHF | Decoder | Unidirectional | Alignment with preferences |
Each objective shapes what the model learns. Causal LM excels at generation. MLM excels at understanding. Instruction tuning and RLHF shape behavior and safety. Modern systems often combine these: pretrain with language modeling, then fine-tune with instructions, then align with RLHF.
13.9 Mathematical details
13.9.1 Cross-entropy loss
All the training objectives in this chapter share a common foundation: cross-entropy loss. Understanding cross-entropy requires stepping back to information theory.
Entropy: measuring uncertainty
Entropy quantifies uncertainty. If you flip a fair coin, there are two equally likely outcomes. The entropy is:
\[ H = -\sum_i p_i \log_2 p_i = -\frac{1}{2}\log_2 \frac{1}{2} - \frac{1}{2}\log_2 \frac{1}{2} = 1 \text{ bit} \]
One bit of entropy means you need one yes/no question to determine the outcome (“Was it heads?”). If the coin is biased with \(P(\text{heads}) = 0.99\), the entropy is much lower:
\[ H = -0.99 \log_2 0.99 - 0.01 \log_2 0.01 \approx 0.08 \text{ bits} \]
A nearly-certain outcome has low entropy. You barely need any information to predict it.
For a probability distribution \(p\) over \(V\) outcomes:
\[ H(p) = -\sum_{i=1}^{V} p_i \log p_i \]
By convention, \(0 \log 0 = 0\). Entropy is maximized when all outcomes are equally likely (maximum uncertainty) and minimized when one outcome has probability 1 (no uncertainty).
Cross-entropy: comparing distributions
Cross-entropy measures how well one distribution \(q\) predicts samples from another distribution \(p\). If the true distribution is \(p\) and we use distribution \(q\) to encode outcomes:
\[ H(p, q) = -\sum_{i=1}^{V} p_i \log q_i \]
This is the average number of bits needed to encode outcomes from \(p\) using a code optimized for \(q\). If \(q = p\), cross-entropy equals entropy: \(H(p, p) = H(p)\). If \(q \neq p\), cross-entropy is larger. Using the wrong distribution wastes bits.
The difference \(H(p, q) - H(p)\) is called the Kullback-Leibler divergence, often written \(D_{KL}(p || q)\). It measures how much worse \(q\) is compared to the optimal code. KL divergence is always non-negative and equals zero only when \(p = q\).
Cross-entropy for classification
In machine learning, we typically have a true label (one correct answer) rather than a full distribution. If the correct class is \(y\), the true distribution is:
\[ p_i = \begin{cases} 1 & \text{if } i = y \\ 0 & \text{otherwise} \end{cases} \]
This is a “one-hot” distribution: all probability mass on one outcome. The cross-entropy simplifies:
\[ H(p, q) = -\sum_{i} p_i \log q_i = -1 \cdot \log q_y - \sum_{i \neq y} 0 \cdot \log q_i = -\log q_y \]
Only the predicted probability of the correct class matters. If the model assigns \(q_y = 0.9\) to the correct answer, the loss is \(-\log 0.9 \approx 0.105\). If the model assigns \(q_y = 0.1\), the loss is \(-\log 0.1 \approx 2.303\). Lower probability for the correct answer means higher loss.
The logarithm’s role
Why use \(-\log\) rather than just \(1 - q_y\) or \((1 - q_y)^2\)? Several reasons:
Gradient behavior: When the model is very wrong (\(q_y \approx 0\)), the gradient of \(-\log q_y\) is large, providing strong learning signal. The gradient of \(1 - q_y\) would be small regardless of how wrong the prediction is.
Maximum likelihood: Minimizing cross-entropy is equivalent to maximizing likelihood. If we observe tokens \(y_1, y_2, \ldots, y_T\), the likelihood is \(\prod_t q_{y_t}\). The log-likelihood is \(\sum_t \log q_{y_t}\). Maximizing this equals minimizing \(-\sum_t \log q_{y_t}\), the sum of cross-entropies.
Information-theoretic meaning: The loss \(-\log q_y\) is the number of bits (or nats, if using natural log) needed to encode the outcome \(y\) using the model’s distribution. Good models use fewer bits.
A concrete example
Suppose the model predicts the next token in “The capital of France is ___“. The vocabulary has 50,000 tokens. The model outputs probabilities:
| Token | Probability |
|---|---|
| Paris | 0.72 |
| Lyon | 0.08 |
| Berlin | 0.03 |
| France | 0.02 |
| … | … |
| (all others) | 0.15 total |
If the true next token is “Paris”, the cross-entropy loss is:
\[ \mathcal{L} = -\log 0.72 \approx 0.329 \]
If the model had assigned only 0.01 probability to “Paris”:
\[ \mathcal{L} = -\log 0.01 \approx 4.605 \]
The second case has 14 times higher loss. The model made a confident wrong prediction, and cross-entropy penalizes this severely.
From logits to probabilities: softmax
Neural networks don’t directly output probabilities. They output logits: unbounded real numbers \(z_1, z_2, \ldots, z_V\), one per vocabulary item. We convert logits to probabilities using the softmax function:
\[ q_i = \text{softmax}(z)_i = \frac{\exp(z_i)}{\sum_{j=1}^{V} \exp(z_j)} \]
Softmax ensures all outputs are positive and sum to 1. Larger logits produce larger probabilities. The exponential amplifies differences: if \(z_i\) is much larger than other logits, \(q_i\) approaches 1.
The cross-entropy loss in terms of logits is:
\[ \mathcal{L}_{CE} = -\log q_y = -\log \frac{\exp(z_y)}{\sum_j \exp(z_j)} = -z_y + \log \sum_j \exp(z_j) \]
This is called the “log-sum-exp” form. The first term \(-z_y\) rewards high logit for the correct class. The second term \(\log \sum_j \exp(z_j)\) penalizes large logits for all classes, preventing the model from making all logits arbitrarily large.
The gradient
The gradient of cross-entropy loss with respect to logits has a remarkably simple form:
\[ \frac{\partial \mathcal{L}_{CE}}{\partial z_i} = q_i - \mathbb{1}_{i=y} \]
where \(\mathbb{1}_{i=y}\) is 1 if \(i = y\) (the correct class) and 0 otherwise.
For the correct class \(y\): the gradient is \(q_y - 1\). Since \(q_y < 1\), this is negative. Gradient descent subtracts the gradient, so \(z_y\) increases, making the correct answer more likely.
For incorrect classes \(i \neq y\): the gradient is \(q_i - 0 = q_i\). This is positive. Gradient descent decreases \(z_i\), making wrong answers less likely.
The magnitude matters too. If the model assigns \(q_i = 0.4\) to a wrong answer, the gradient pushes down with strength 0.4. If \(q_i = 0.01\), the push is weak. The model focuses on fixing its biggest mistakes.
Deriving the gradient: Let’s verify this. The loss is:
\[ \mathcal{L} = -z_y + \log \sum_j \exp(z_j) \]
For the first term: \(\frac{\partial(-z_y)}{\partial z_i} = -\mathbb{1}_{i=y}\)
For the second term, using the chain rule:
\[ \frac{\partial}{\partial z_i} \log \sum_j \exp(z_j) = \frac{\exp(z_i)}{\sum_j \exp(z_j)} = q_i \]
Combining: \(\frac{\partial \mathcal{L}}{\partial z_i} = q_i - \mathbb{1}_{i=y}\)
Numerical stability
Computing \(\exp(z_i)\) for large \(z_i\) causes overflow. Computing \(\log(q_y)\) for small \(q_y\) causes underflow. In practice, we use numerically stable implementations.
For softmax, subtract the maximum logit before exponentiating:
\[ q_i = \frac{\exp(z_i - \max_j z_j)}{\sum_j \exp(z_j - \max_j z_j)} \]
This gives the same result (the constant cancels) but keeps values in a reasonable range.
For log-softmax, compute directly:
\[ \log q_i = z_i - \log \sum_j \exp(z_j) \]
The log-sum-exp has a stable form: \(\log \sum_j \exp(z_j) = m + \log \sum_j \exp(z_j - m)\) where \(m = \max_j z_j\).
Deep learning frameworks provide fused log_softmax and cross_entropy functions that handle these details automatically. Always use these rather than computing softmax and log separately.
Why cross-entropy works for language models
A language model predicts the next token at each position. With vocabulary size \(V = 50,000\) and sequence length \(T = 1024\), each training example involves 1024 classification problems, each over 50,000 classes.
Cross-entropy provides several advantages:
Sparse gradients: Only the correct token’s probability matters for the loss value, but the gradient affects all logits. This is computationally efficient.
Calibrated probabilities: Minimizing cross-entropy encourages the model to output well-calibrated probabilities, not just correct rankings. If the model says 80% confidence, it should be right about 80% of the time.
Additive over positions: The total loss is the sum of per-position losses. This decomposes naturally over sequence positions and batches.
Interpretable: The loss in nats (natural log) or bits (log base 2) has meaning. Perplexity \(= \exp(\mathcal{L})\) measures how many tokens the model is “choosing between” on average.
13.9.2 Label smoothing
To prevent overconfidence, we often use label smoothing. Instead of targeting probability 1 for the correct token:
\[ q_j = \begin{cases} 1 - \epsilon + \epsilon/V & \text{if } j = y \\ \epsilon/V & \text{otherwise} \end{cases} \]
where \(\epsilon\) is the smoothing parameter (typically 0.1) and \(V\) is vocabulary size. This distributes small probability mass to all tokens, encouraging the model to remain slightly uncertain.
13.9.3 Teacher forcing
During training, we use teacher forcing: the model receives true previous tokens as input, not its own predictions. This provides stable gradients but creates a train-test mismatch since at inference the model must use its own (potentially incorrect) predictions.
Scheduled sampling gradually transitions from teacher forcing to using model predictions during training, but this complicates training and is rarely used with transformers. The train-test mismatch is generally accepted as a reasonable trade-off for training stability.
The training objectives covered here transform the raw transformer architecture into capable language models. The choice of objective determines what the model learns and how it behaves. In the next chapter, we examine how these models scale with compute, data, and parameters.