5 Neural networks basics

Learning objectives

After completing this chapter, you will be able to:

Describe how a single neuron computes a weighted sum with nonlinear activation
Construct multilayer networks and trace information flow through them
Define common loss functions for regression and classification
Derive the backpropagation algorithm for computing gradients
Implement gradient descent to update network parameters

Neural networks are the foundation of modern machine learning. Before we can understand transformers, we need to master how neural networks work: how they represent functions, how information flows forward through them, and how they learn through backpropagation. This chapter develops these ideas from first principles, with the mathematical rigor we established in the prerequisites.

5.1 The single neuron

The simplest neural network is a single neuron. It takes $n$ inputs $x_1, x_2, \ldots, x_n$, multiplies each by a weight, adds them up with a bias term, and applies a nonlinear function:

\[ y = \sigma\left(\sum_{i=1}^n w_i x_i + b\right) = \sigma(\mathbf{w}^T\mathbf{x} + b) \]

where $\mathbf{w} = [w_1, \ldots, w_n]^T$ are the weights, $b$ is the bias, and $\sigma$ is an activation function. Let’s unpack each component.

The linear part $\mathbf{w}^T\mathbf{x} + b$ computes a weighted sum of inputs plus a constant offset. Geometrically, this defines a hyperplane in input space. In 2D, if $\mathbf{x} = [x_1, x_2]^T$, then $w_1 x_1 + w_2 x_2 + b = 0$ is a line. The weight vector $\mathbf{w}$ is perpendicular to this line, and $b$ shifts it away from the origin.

Consider a concrete example with $\mathbf{w} = [2, 1]^T$ and $b = -3$. The linear part computes $2x_1 + x_2 - 3$. When is this positive? When $x_2 > -2x_1 + 3$, i.e., above the line $x_2 = -2x_1 + 3$. The neuron divides input space into two regions: one where the linear part is positive, one where it’s negative.

But if we only had the linear part, the neuron could only represent linear functions. Stacking linear functions gives more linear functions: if $f(\mathbf{x}) = \mathbf{A}\mathbf{x}$ and $g(\mathbf{y}) = \mathbf{B}\mathbf{y}$, then $g(f(\mathbf{x})) = \mathbf{B}\mathbf{A}\mathbf{x}$, which is still linear. This is why we need the activation function $\sigma$: it introduces nonlinearity.

Common activation functions include:

Sigmoid: $\sigma(z) = \frac{1}{1 + e^{-z}}$. This squashes any input to the range $(0, 1)$. For large positive $z$, $\sigma(z) \approx 1$. For large negative $z$, $\sigma(z) \approx 0$. The transition is smooth, centered at $z = 0$ where $\sigma(0) = 0.5$.

ReLU (Rectified Linear Unit): $\text{ReLU}(z) = \max(0, z)$. This is zero for negative inputs and the identity for positive inputs. It’s piecewise linear but not linear overall (the “kink” at zero breaks linearity). ReLU is computationally cheap and avoids some training problems that plague sigmoid.

Tanh: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$. Like sigmoid but outputs in $(-1, 1)$ and is centered at zero.

What would violate our need for nonlinearity? The identity function $\sigma(z) = z$ is linear, so using it as an activation would make the entire network linear regardless of depth. Any affine function $\sigma(z) = az + b$ has the same problem.

Let’s trace through a concrete neuron. Suppose we have inputs $\mathbf{x} = [0.5, 0.8]^T$, weights $\mathbf{w} = [0.4, 0.6]^T$, bias $b = -0.5$, and we use sigmoid activation. The computation proceeds:

\[ z = \mathbf{w}^T\mathbf{x} + b = 0.4 \cdot 0.5 + 0.6 \cdot 0.8 - 0.5 = 0.2 + 0.48 - 0.5 = 0.18 \]

\[ y = \sigma(0.18) = \frac{1}{1 + e^{-0.18}} = \frac{1}{1 + 0.835} = \frac{1}{1.835} \approx 0.545 \]

The neuron outputs approximately 0.545, slightly above the midpoint of 0.5 because the weighted sum 0.18 is slightly positive.

5.2 Multilayer networks

A single neuron can only represent functions that are “nearly linear” (linear followed by a squashing function). To represent complex functions, we stack neurons into layers.

A feedforward neural network (also called multilayer perceptron or MLP) consists of:

An input layer: the raw input $\mathbf{x} \in \mathbb{R}^{n_0}$
One or more hidden layers: intermediate representations
An output layer: the final prediction $\mathbf{y} \in \mathbb{R}^{n_L}$

Let’s define the computation precisely. For a network with $L$ layers, let $\mathbf{h}^{(0)} = \mathbf{x}$ be the input. For each layer $\ell = 1, \ldots, L$:

\[ \mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)} \]

\[ \mathbf{h}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}) \]

where $\mathbf{W}^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ is the weight matrix for layer $\ell$, $\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell}$ is the bias vector, and $\sigma$ is applied element-wise. The final output is $\mathbf{y} = \mathbf{h}^{(L)}$.

Let’s work through a concrete example. Consider a network with:

Input dimension: $n_0 = 2$
Hidden layer: $n_1 = 3$ neurons with ReLU activation
Output layer: $n_2 = 1$ neuron with sigmoid activation

The weight matrices have shapes $\mathbf{W}^{(1)} \in \mathbb{R}^{3 \times 2}$ and $\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times 3}$. Suppose:

\[ \mathbf{W}^{(1)} = \begin{bmatrix} 0.2 & 0.4 \\ -0.5 & 0.3 \\ 0.1 & -0.2 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} \]

\[ \mathbf{W}^{(2)} = \begin{bmatrix} 0.6 & -0.4 & 0.5 \end{bmatrix}, \quad \mathbf{b}^{(2)} = \begin{bmatrix} -0.2 \end{bmatrix} \]

For input $\mathbf{x} = [1.0, 0.5]^T$:

Layer 1 (hidden):

\[ \mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = \begin{bmatrix} 0.2 \cdot 1.0 + 0.4 \cdot 0.5 \\ -0.5 \cdot 1.0 + 0.3 \cdot 0.5 \\ 0.1 \cdot 1.0 - 0.2 \cdot 0.5 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.4 \\ -0.35 \\ 0 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.5 \\ -0.45 \\ 0.2 \end{bmatrix} \]

\[ \mathbf{h}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) = \begin{bmatrix} \max(0, 0.5) \\ \max(0, -0.45) \\ \max(0, 0.2) \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0 \\ 0.2 \end{bmatrix} \]

Notice how the second neuron outputs 0 because its pre-activation was negative. ReLU “kills” that neuron for this input.

Layer 2 (output):

\[ z^{(2)} = \mathbf{W}^{(2)}\mathbf{h}^{(1)} + b^{(2)} = 0.6 \cdot 0.5 + (-0.4) \cdot 0 + 0.5 \cdot 0.2 - 0.2 = 0.3 + 0 + 0.1 - 0.2 = 0.2 \]

\[ y = \sigma(0.2) = \frac{1}{1 + e^{-0.2}} \approx 0.55 \]

The network maps input $[1.0, 0.5]^T$ to output $\approx 0.55$.

5.2.1 Why depth matters

Why use multiple layers instead of one wide layer? The key insight is that depth enables hierarchical representations.

A single hidden layer with enough neurons can approximate any continuous function (this is the universal approximation theorem). But “enough neurons” might be exponentially many. Deep networks can represent certain functions much more efficiently.

Consider computing the parity function: output 1 if an odd number of inputs are 1, output 0 otherwise. With one hidden layer, you need exponentially many neurons (roughly $2^{n-1}$). With $\log n$ layers, you need only $O(n)$ neurons total, by computing parity hierarchically: first compute parity of pairs, then parity of those results, and so on.

More practically, deep networks learn hierarchical features. In image recognition, early layers learn edges, middle layers learn shapes, deep layers learn objects. In language, early layers learn character patterns, middle layers learn words and phrases, deep layers learn semantics. This hierarchical structure mirrors the structure of the data.

5.3 Loss functions

A neural network has parameters $\boldsymbol{\theta}$ (all weights and biases). Given training data $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m$, we want to find parameters that make the network’s predictions close to the true outputs. We measure “closeness” with a loss function $\mathcal{L}(\boldsymbol{\theta})$.

5.3.1 Mean squared error

For regression (predicting continuous values), a natural choice is mean squared error (MSE):

\[ \mathcal{L}_{\text{MSE}} = \frac{1}{m} \sum_{i=1}^m (f(\mathbf{x}^{(i)}; \boldsymbol{\theta}) - y^{(i)})^2 \]

where $f(\mathbf{x}; \boldsymbol{\theta})$ is the network’s output. This penalizes predictions quadratically: being off by 2 is four times worse than being off by 1.

Let’s compute MSE for a simple example. Suppose we have three data points with true values $y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 1$ and our network predicts $\hat{y}^{(1)} = 0.8, \hat{y}^{(2)} = 0.3, \hat{y}^{(3)} = 0.9$. Then:

\[ \mathcal{L}_{\text{MSE}} = \frac{1}{3}[(0.8-1)^2 + (0.3-0)^2 + (0.9-1)^2] = \frac{1}{3}[0.04 + 0.09 + 0.01] = \frac{0.14}{3} \approx 0.047 \]

5.3.2 Cross-entropy loss

For classification, we use cross-entropy loss. Suppose we’re doing binary classification: the true label is $y \in \{0, 1\}$ and the network outputs a probability $\hat{y} = \sigma(z) \in (0, 1)$. The cross-entropy loss is:

\[ \mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] \]

Why this formula? Recall from the prerequisites that cross-entropy measures how well a predicted distribution $q$ matches a true distribution $p$. Here, the true distribution puts all mass on the correct class. If $y = 1$, the loss is $-\log \hat{y}$: high loss if $\hat{y}$ is small (confident wrong prediction), low loss if $\hat{y}$ is large (confident correct prediction). If $y = 0$, the loss is $-\log(1 - \hat{y})$: high loss if $\hat{y}$ is large, low loss if $\hat{y}$ is small.

Concrete example: true labels $y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 1$ and predictions $\hat{y}^{(1)} = 0.9, \hat{y}^{(2)} = 0.2, \hat{y}^{(3)} = 0.8$.

\[ \mathcal{L}_{\text{CE}} = -\frac{1}{3}\left[\log(0.9) + \log(0.8) + \log(0.8)\right] = -\frac{1}{3}[-0.105 - 0.223 - 0.223] = \frac{0.551}{3} \approx 0.184 \]

For multiclass classification with $K$ classes, the network outputs a probability distribution $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z})$ and the cross-entropy becomes:

\[ \mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log \hat{y}_k^{(i)} \]

where $y_k^{(i)} = 1$ if example $i$ belongs to class $k$, and 0 otherwise (one-hot encoding).

5.4 Backpropagation

We want to minimize the loss $\mathcal{L}(\boldsymbol{\theta})$ by adjusting parameters $\boldsymbol{\theta}$. Gradient-based optimization requires computing $\nabla_{\boldsymbol{\theta}} \mathcal{L}$: how does the loss change when we change each parameter? For a network with millions of parameters, we need an efficient algorithm. This is backpropagation.

5.4.1 The computational graph perspective

A neural network computation can be viewed as a directed acyclic graph where:

Nodes represent intermediate values (inputs, activations, outputs, loss)
Edges represent operations (matrix multiply, add bias, apply activation)

For our two-layer network example:

\[ \mathbf{x} \to \mathbf{z}^{(1)} \to \mathbf{h}^{(1)} \to z^{(2)} \to y \to \mathcal{L} \]

Each arrow is a function. To find $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}}$, we need to trace how changes in $\mathbf{W}^{(1)}$ propagate through the graph to affect $\mathcal{L}$.

5.4.2 Forward and backward passes

Backpropagation has two phases:

Forward pass: Compute all intermediate values from input to loss
Backward pass: Compute all gradients from loss back to parameters

The backward pass uses the chain rule systematically. For each node, we compute “how much does the loss change if this node’s value changes?” Let’s define:

\[ \delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} \]

This is the “error signal” at layer $\ell$: how sensitive is the loss to the pre-activation values?

5.4.3 Deriving the backpropagation equations

Let’s derive backpropagation for our two-layer network with sigmoid output and MSE loss. For a single training example (we’ll drop the superscript $(i)$), the loss is:

\[ \mathcal{L} = \frac{1}{2}(y - \hat{y})^2 \]

where $\hat{y} = \sigma(z^{(2)})$ and we include the $\frac{1}{2}$ to simplify derivatives.

Output layer gradient:

\[ \frac{\partial \mathcal{L}}{\partial z^{(2)}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} = -(y - \hat{y}) \cdot \sigma(z^{(2)})(1 - \sigma(z^{(2)})) = -(\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) \]

Wait, let me redo this more carefully. We have:

\[ \frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}}\left[\frac{1}{2}(y - \hat{y})^2\right] = -(y - \hat{y}) = \hat{y} - y \]

\[ \frac{\partial \hat{y}}{\partial z^{(2)}} = \frac{d\sigma}{dz}\bigg|_{z=z^{(2)}} = \sigma(z^{(2)})(1 - \sigma(z^{(2)})) = \hat{y}(1 - \hat{y}) \]

So:

\[ \delta^{(2)} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) \]

Now we can compute the gradients for layer 2’s parameters:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{h}^{(1)})^T \]

\[ \frac{\partial \mathcal{L}}{\partial b^{(2)}} = \delta^{(2)} \]

Hidden layer gradient:

To backpropagate to layer 1, we need:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial \mathbf{h}^{(1)}} = \delta^{(2)} \cdot \mathbf{W}^{(2)} \]

This is a key insight: the error signal at a layer equals the error signal from the next layer, multiplied by the weights connecting them. The weights “distribute” the error backwards.

Then:

\[ \delta^{(1)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot \frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{z}^{(1)}} \]

where $\odot$ is element-wise multiplication. For ReLU:

\[ \frac{\partial h_j^{(1)}}{\partial z_j^{(1)}} = \begin{cases} 1 & \text{if } z_j^{(1)} > 0 \\ 0 & \text{if } z_j^{(1)} \leq 0 \end{cases} \]

So $\delta^{(1)} = (\mathbf{W}^{(2)})^T \delta^{(2)} \odot \mathbf{1}_{z^{(1)} > 0}$, where $\mathbf{1}_{z^{(1)} > 0}$ is 1 where $z^{(1)}$ is positive and 0 otherwise.

Finally:

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \cdot \mathbf{x}^T \]

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} = \delta^{(1)} \]

5.4.4 Concrete backpropagation example

Let’s compute gradients for our earlier forward pass example. We had:

Input: $\mathbf{x} = [1.0, 0.5]^T$
Hidden activations: $\mathbf{z}^{(1)} = [0.5, -0.45, 0.2]^T$, $\mathbf{h}^{(1)} = [0.5, 0, 0.2]^T$
Output: $z^{(2)} = 0.2$, $\hat{y} = \sigma(0.2) \approx 0.55$

Suppose the true label is $y = 1$ (so we want the output to be higher).

Step 1: Output layer error

\[ \delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) = (0.55 - 1) \cdot 0.55 \cdot 0.45 = -0.45 \cdot 0.2475 \approx -0.111 \]

The negative sign indicates we should adjust to increase the output.

Step 2: Output layer gradients

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{h}^{(1)})^T = -0.111 \cdot [0.5, 0, 0.2] = [-0.056, 0, -0.022] \]

\[ \frac{\partial \mathcal{L}}{\partial b^{(2)}} = -0.111 \]

Step 3: Backpropagate to hidden layer

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} = (\mathbf{W}^{(2)})^T \cdot \delta^{(2)} = \begin{bmatrix} 0.6 \\ -0.4 \\ 0.5 \end{bmatrix} \cdot (-0.111) = \begin{bmatrix} -0.067 \\ 0.044 \\ -0.056 \end{bmatrix} \]

Step 4: ReLU gradient

The ReLU gradient is 1 where $z^{(1)} > 0$ and 0 otherwise. Since $\mathbf{z}^{(1)} = [0.5, -0.45, 0.2]^T$, the mask is $[1, 0, 1]^T$.

\[ \delta^{(1)} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot [1, 0, 1]^T = [-0.067, 0, -0.056]^T \]

The gradient through the second neuron is zero because ReLU killed it during the forward pass.

Step 5: Hidden layer gradients

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \cdot \mathbf{x}^T = \begin{bmatrix} -0.067 \\ 0 \\ -0.056 \end{bmatrix} \cdot [1.0, 0.5] = \begin{bmatrix} -0.067 & -0.033 \\ 0 & 0 \\ -0.056 & -0.028 \end{bmatrix} \]

\[ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} = [-0.067, 0, -0.056]^T \]

These gradients tell us how to adjust each parameter to reduce the loss.

5.5 Gradient descent

With gradients in hand, we update parameters to reduce the loss. The simplest approach is gradient descent:

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} \]

where $\eta > 0$ is the learning rate. We move parameters in the direction opposite to the gradient (since the gradient points toward increasing loss).

5.5.1 The learning rate

The learning rate $\eta$ controls step size. Too large, and we might overshoot the minimum and diverge. Too small, and training takes forever.

Consider minimizing $f(\theta) = \theta^2$. The minimum is at $\theta = 0$ with gradient $\frac{df}{d\theta} = 2\theta$. Starting at $\theta_0 = 1$:

With $\eta = 0.1$: $\theta_1 = 1 - 0.1 \cdot 2 = 0.8$, $\theta_2 = 0.8 - 0.1 \cdot 1.6 = 0.64$, … converges slowly
With $\eta = 0.5$: $\theta_1 = 1 - 0.5 \cdot 2 = 0$, reaches minimum in one step!
With $\eta = 0.9$: $\theta_1 = 1 - 0.9 \cdot 2 = -0.8$, $\theta_2 = -0.8 - 0.9 \cdot (-1.6) = 0.64$, oscillates but converges
With $\eta = 1.1$: $\theta_1 = 1 - 1.1 \cdot 2 = -1.2$, $\theta_2 = -1.2 - 1.1 \cdot (-2.4) = 1.44$, diverges!

For this simple quadratic, any $\eta < 1$ converges, $\eta = 0.5$ is optimal, and $\eta > 1$ diverges. Real loss landscapes are more complex, but the intuition holds: there’s a “Goldilocks zone” for learning rates.

5.5.2 Stochastic and mini-batch gradient descent

Computing the gradient over all $m$ training examples is expensive. Stochastic gradient descent (SGD) uses a single random example:

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}^{(i)} \]

where $\mathcal{L}^{(i)}$ is the loss for example $i$. This is noisy but much faster per update.

Mini-batch gradient descent is a compromise: use a small batch of $B$ examples:

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \cdot \frac{1}{B} \sum_{i \in \text{batch}} \nabla_{\boldsymbol{\theta}} \mathcal{L}^{(i)} \]

Typical batch sizes are 32, 64, 128, or 256. Mini-batches reduce variance compared to SGD while remaining computationally efficient.

5.5.3 Momentum and Adam

Plain SGD can be slow, especially in “ravines” where the gradient points mostly sideways. Momentum accumulates a velocity:

\[ \mathbf{v} \leftarrow \beta \mathbf{v} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} \]

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{v} \]

where $\beta \approx 0.9$ is the momentum coefficient. This smooths out oscillations and accelerates along consistent gradient directions.

Adam (Adaptive Moment Estimation) goes further by adapting the learning rate for each parameter based on historical gradients. It maintains running averages of the gradient and squared gradient:

\[ \mathbf{m} \leftarrow \beta_1 \mathbf{m} + (1 - \beta_1) \nabla_{\boldsymbol{\theta}} \mathcal{L} \]

\[ \mathbf{v} \leftarrow \beta_2 \mathbf{v} + (1 - \beta_2) (\nabla_{\boldsymbol{\theta}} \mathcal{L})^2 \]

\[ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \frac{\hat{\mathbf{m}}}{\sqrt{\hat{\mathbf{v}}} + \epsilon} \]

where $\hat{\mathbf{m}}$ and $\hat{\mathbf{v}}$ are bias-corrected estimates and $\epsilon \approx 10^{-8}$ prevents division by zero. Adam is the default optimizer for training transformers.

5.6 Putting it together

Let’s summarize the training loop for a neural network:

Initialize parameters randomly (careful initialization matters; we’ll discuss this later)
Repeat until convergence:
1. Sample a mini-batch of training examples
2. Forward pass: Compute predictions and loss
3. Backward pass: Compute gradients via backpropagation
4. Update: Adjust parameters using optimizer (SGD, Adam, etc.)
Evaluate on held-out data to check generalization

This loop is the heartbeat of deep learning. Every neural network, from simple MLPs to massive transformers, trains this way. The differences lie in architecture (what functions the network computes) and scale (how many parameters, how much data, how much compute).

In the next chapter, we’ll see why standard feedforward networks struggle with sequential data, motivating the architectures that eventually led to transformers.

# Neural networks basics ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - Describe how a single neuron computes a weighted sum with nonlinear activation - Construct multilayer networks and trace information flow through them - Define common loss functions for regression and classification - Derive the backpropagation algorithm for computing gradients - Implement gradient descent to update network parameters ::: Neural networks are the foundation of modern machine learning. Before we can understand transformers, we need to master how neural networks work: how they represent functions, how information flows forward through them, and how they learn through backpropagation. This chapter develops these ideas from first principles, with the mathematical rigor we established in the prerequisites. ## The single neuron The simplest neural network is a single neuron. It takes $n$ inputs $x_1, x_2, \ldots, x_n$, multiplies each by a weight, adds them up with a bias term, and applies a nonlinear function: $$ y = \sigma\left(\sum_{i=1}^n w_i x_i + b\right) = \sigma(\mathbf{w}^T\mathbf{x} + b) $$ where $\mathbf{w} = [w_1, \ldots, w_n]^T$ are the weights, $b$ is the bias, and $\sigma$ is an **activation function**. Let's unpack each component. The linear part $\mathbf{w}^T\mathbf{x} + b$ computes a weighted sum of inputs plus a constant offset. Geometrically, this defines a hyperplane in input space. In 2D, if $\mathbf{x} = [x_1, x_2]^T$, then $w_1 x_1 + w_2 x_2 + b = 0$ is a line. The weight vector $\mathbf{w}$ is perpendicular to this line, and $b$ shifts it away from the origin. Consider a concrete example with $\mathbf{w} = [2, 1]^T$ and $b = -3$. The linear part computes $2x_1 + x_2 - 3$. When is this positive? When $x_2 > -2x_1 + 3$, i.e., above the line $x_2 = -2x_1 + 3$. The neuron divides input space into two regions: one where the linear part is positive, one where it's negative. But if we only had the linear part, the neuron could only represent linear functions. Stacking linear functions gives more linear functions: if $f(\mathbf{x}) = \mathbf{A}\mathbf{x}$ and $g(\mathbf{y}) = \mathbf{B}\mathbf{y}$, then $g(f(\mathbf{x})) = \mathbf{B}\mathbf{A}\mathbf{x}$, which is still linear. This is why we need the activation function $\sigma$: it introduces nonlinearity. Common activation functions include: **Sigmoid**: $\sigma(z) = \frac{1}{1 + e^{-z}}$. This squashes any input to the range $(0, 1)$. For large positive $z$, $\sigma(z) \approx 1$. For large negative $z$, $\sigma(z) \approx 0$. The transition is smooth, centered at $z = 0$ where $\sigma(0) = 0.5$. **ReLU** (Rectified Linear Unit): $\text{ReLU}(z) = \max(0, z)$. This is zero for negative inputs and the identity for positive inputs. It's piecewise linear but not linear overall (the "kink" at zero breaks linearity). ReLU is computationally cheap and avoids some training problems that plague sigmoid. **Tanh**: $\tanh(z) = \frac{e^z - e^{-z}}{e^z + e^{-z}}$. Like sigmoid but outputs in $(-1, 1)$ and is centered at zero. What would violate our need for nonlinearity? The identity function $\sigma(z) = z$ is linear, so using it as an activation would make the entire network linear regardless of depth. Any affine function $\sigma(z) = az + b$ has the same problem. Let's trace through a concrete neuron. Suppose we have inputs $\mathbf{x} = [0.5, 0.8]^T$, weights $\mathbf{w} = [0.4, 0.6]^T$, bias $b = -0.5$, and we use sigmoid activation. The computation proceeds: $$ z = \mathbf{w}^T\mathbf{x} + b = 0.4 \cdot 0.5 + 0.6 \cdot 0.8 - 0.5 = 0.2 + 0.48 - 0.5 = 0.18 $$ $$ y = \sigma(0.18) = \frac{1}{1 + e^{-0.18}} = \frac{1}{1 + 0.835} = \frac{1}{1.835} \approx 0.545 $$ The neuron outputs approximately 0.545, slightly above the midpoint of 0.5 because the weighted sum 0.18 is slightly positive. ## Multilayer networks A single neuron can only represent functions that are "nearly linear" (linear followed by a squashing function). To represent complex functions, we stack neurons into layers. A **feedforward neural network** (also called multilayer perceptron or MLP) consists of: - An **input layer**: the raw input $\mathbf{x} \in \mathbb{R}^{n_0}$ - One or more **hidden layers**: intermediate representations - An **output layer**: the final prediction $\mathbf{y} \in \mathbb{R}^{n_L}$ Let's define the computation precisely. For a network with $L$ layers, let $\mathbf{h}^{(0)} = \mathbf{x}$ be the input. For each layer $\ell = 1, \ldots, L$: $$ \mathbf{z}^{(\ell)} = \mathbf{W}^{(\ell)} \mathbf{h}^{(\ell-1)} + \mathbf{b}^{(\ell)} $$ $$ \mathbf{h}^{(\ell)} = \sigma(\mathbf{z}^{(\ell)}) $$ where $\mathbf{W}^{(\ell)} \in \mathbb{R}^{n_\ell \times n_{\ell-1}}$ is the weight matrix for layer $\ell$, $\mathbf{b}^{(\ell)} \in \mathbb{R}^{n_\ell}$ is the bias vector, and $\sigma$ is applied element-wise. The final output is $\mathbf{y} = \mathbf{h}^{(L)}$. Let's work through a concrete example. Consider a network with: - Input dimension: $n_0 = 2$ - Hidden layer: $n_1 = 3$ neurons with ReLU activation - Output layer: $n_2 = 1$ neuron with sigmoid activation The weight matrices have shapes $\mathbf{W}^{(1)} \in \mathbb{R}^{3 \times 2}$ and $\mathbf{W}^{(2)} \in \mathbb{R}^{1 \times 3}$. Suppose: $$ \mathbf{W}^{(1)} = \begin{bmatrix} 0.2 & 0.4 \\ -0.5 & 0.3 \\ 0.1 & -0.2 \end{bmatrix}, \quad \mathbf{b}^{(1)} = \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} $$ $$ \mathbf{W}^{(2)} = \begin{bmatrix} 0.6 & -0.4 & 0.5 \end{bmatrix}, \quad \mathbf{b}^{(2)} = \begin{bmatrix} -0.2 \end{bmatrix} $$ For input $\mathbf{x} = [1.0, 0.5]^T$: **Layer 1 (hidden):** $$ \mathbf{z}^{(1)} = \mathbf{W}^{(1)}\mathbf{x} + \mathbf{b}^{(1)} = \begin{bmatrix} 0.2 \cdot 1.0 + 0.4 \cdot 0.5 \\ -0.5 \cdot 1.0 + 0.3 \cdot 0.5 \\ 0.1 \cdot 1.0 - 0.2 \cdot 0.5 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.4 \\ -0.35 \\ 0 \end{bmatrix} + \begin{bmatrix} 0.1 \\ -0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 0.5 \\ -0.45 \\ 0.2 \end{bmatrix} $$ $$ \mathbf{h}^{(1)} = \text{ReLU}(\mathbf{z}^{(1)}) = \begin{bmatrix} \max(0, 0.5) \\ \max(0, -0.45) \\ \max(0, 0.2) \end{bmatrix} = \begin{bmatrix} 0.5 \\ 0 \\ 0.2 \end{bmatrix} $$ Notice how the second neuron outputs 0 because its pre-activation was negative. ReLU "kills" that neuron for this input. **Layer 2 (output):** $$ z^{(2)} = \mathbf{W}^{(2)}\mathbf{h}^{(1)} + b^{(2)} = 0.6 \cdot 0.5 + (-0.4) \cdot 0 + 0.5 \cdot 0.2 - 0.2 = 0.3 + 0 + 0.1 - 0.2 = 0.2 $$ $$ y = \sigma(0.2) = \frac{1}{1 + e^{-0.2}} \approx 0.55 $$ The network maps input $[1.0, 0.5]^T$ to output $\approx 0.55$. ### Why depth matters Why use multiple layers instead of one wide layer? The key insight is that depth enables hierarchical representations. A single hidden layer with enough neurons can approximate any continuous function (this is the **universal approximation theorem**). But "enough neurons" might be exponentially many. Deep networks can represent certain functions much more efficiently. Consider computing the parity function: output 1 if an odd number of inputs are 1, output 0 otherwise. With one hidden layer, you need exponentially many neurons (roughly $2^{n-1}$). With $\log n$ layers, you need only $O(n)$ neurons total, by computing parity hierarchically: first compute parity of pairs, then parity of those results, and so on. More practically, deep networks learn hierarchical features. In image recognition, early layers learn edges, middle layers learn shapes, deep layers learn objects. In language, early layers learn character patterns, middle layers learn words and phrases, deep layers learn semantics. This hierarchical structure mirrors the structure of the data. ## Loss functions A neural network has parameters $\boldsymbol{\theta}$ (all weights and biases). Given training data $\{(\mathbf{x}^{(i)}, y^{(i)})\}_{i=1}^m$, we want to find parameters that make the network's predictions close to the true outputs. We measure "closeness" with a **loss function** $\mathcal{L}(\boldsymbol{\theta})$. ### Mean squared error For regression (predicting continuous values), a natural choice is **mean squared error** (MSE): $$ \mathcal{L}_{\text{MSE}} = \frac{1}{m} \sum_{i=1}^m (f(\mathbf{x}^{(i)}; \boldsymbol{\theta}) - y^{(i)})^2 $$ where $f(\mathbf{x}; \boldsymbol{\theta})$ is the network's output. This penalizes predictions quadratically: being off by 2 is four times worse than being off by 1. Let's compute MSE for a simple example. Suppose we have three data points with true values $y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 1$ and our network predicts $\hat{y}^{(1)} = 0.8, \hat{y}^{(2)} = 0.3, \hat{y}^{(3)} = 0.9$. Then: $$ \mathcal{L}_{\text{MSE}} = \frac{1}{3}[(0.8-1)^2 + (0.3-0)^2 + (0.9-1)^2] = \frac{1}{3}[0.04 + 0.09 + 0.01] = \frac{0.14}{3} \approx 0.047 $$ ### Cross-entropy loss For classification, we use **cross-entropy loss**. Suppose we're doing binary classification: the true label is $y \in \{0, 1\}$ and the network outputs a probability $\hat{y} = \sigma(z) \in (0, 1)$. The cross-entropy loss is: $$ \mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^m \left[ y^{(i)} \log \hat{y}^{(i)} + (1 - y^{(i)}) \log(1 - \hat{y}^{(i)}) \right] $$ Why this formula? Recall from the prerequisites that cross-entropy measures how well a predicted distribution $q$ matches a true distribution $p$. Here, the true distribution puts all mass on the correct class. If $y = 1$, the loss is $-\log \hat{y}$: high loss if $\hat{y}$ is small (confident wrong prediction), low loss if $\hat{y}$ is large (confident correct prediction). If $y = 0$, the loss is $-\log(1 - \hat{y})$: high loss if $\hat{y}$ is large, low loss if $\hat{y}$ is small. Concrete example: true labels $y^{(1)} = 1, y^{(2)} = 0, y^{(3)} = 1$ and predictions $\hat{y}^{(1)} = 0.9, \hat{y}^{(2)} = 0.2, \hat{y}^{(3)} = 0.8$. $$ \mathcal{L}_{\text{CE}} = -\frac{1}{3}\left[\log(0.9) + \log(0.8) + \log(0.8)\right] = -\frac{1}{3}[-0.105 - 0.223 - 0.223] = \frac{0.551}{3} \approx 0.184 $$ For multiclass classification with $K$ classes, the network outputs a probability distribution $\hat{\mathbf{y}} = \text{softmax}(\mathbf{z})$ and the cross-entropy becomes: $$ \mathcal{L}_{\text{CE}} = -\frac{1}{m} \sum_{i=1}^m \sum_{k=1}^K y_k^{(i)} \log \hat{y}_k^{(i)} $$ where $y_k^{(i)} = 1$ if example $i$ belongs to class $k$, and 0 otherwise (one-hot encoding). ## Backpropagation We want to minimize the loss $\mathcal{L}(\boldsymbol{\theta})$ by adjusting parameters $\boldsymbol{\theta}$. Gradient-based optimization requires computing $\nabla_{\boldsymbol{\theta}} \mathcal{L}$: how does the loss change when we change each parameter? For a network with millions of parameters, we need an efficient algorithm. This is **backpropagation**. ### The computational graph perspective A neural network computation can be viewed as a directed acyclic graph where: - Nodes represent intermediate values (inputs, activations, outputs, loss) - Edges represent operations (matrix multiply, add bias, apply activation) For our two-layer network example: $$ \mathbf{x} \to \mathbf{z}^{(1)} \to \mathbf{h}^{(1)} \to z^{(2)} \to y \to \mathcal{L} $$ Each arrow is a function. To find $\frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}}$, we need to trace how changes in $\mathbf{W}^{(1)}$ propagate through the graph to affect $\mathcal{L}$. ### Forward and backward passes Backpropagation has two phases: 1. **Forward pass**: Compute all intermediate values from input to loss 2. **Backward pass**: Compute all gradients from loss back to parameters The backward pass uses the chain rule systematically. For each node, we compute "how much does the loss change if this node's value changes?" Let's define: $$ \delta^{(\ell)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(\ell)}} $$ This is the "error signal" at layer $\ell$: how sensitive is the loss to the pre-activation values? ### Deriving the backpropagation equations Let's derive backpropagation for our two-layer network with sigmoid output and MSE loss. For a single training example (we'll drop the superscript $(i)$), the loss is: $$ \mathcal{L} = \frac{1}{2}(y - \hat{y})^2 $$ where $\hat{y} = \sigma(z^{(2)})$ and we include the $\frac{1}{2}$ to simplify derivatives. **Output layer gradient:** $$ \frac{\partial \mathcal{L}}{\partial z^{(2)}} = \frac{\partial \mathcal{L}}{\partial \hat{y}} \cdot \frac{\partial \hat{y}}{\partial z^{(2)}} = -(y - \hat{y}) \cdot \sigma(z^{(2)})(1 - \sigma(z^{(2)})) = -(\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) $$ Wait, let me redo this more carefully. We have: $$ \frac{\partial \mathcal{L}}{\partial \hat{y}} = \frac{\partial}{\partial \hat{y}}\left[\frac{1}{2}(y - \hat{y})^2\right] = -(y - \hat{y}) = \hat{y} - y $$ $$ \frac{\partial \hat{y}}{\partial z^{(2)}} = \frac{d\sigma}{dz}\bigg|_{z=z^{(2)}} = \sigma(z^{(2)})(1 - \sigma(z^{(2)})) = \hat{y}(1 - \hat{y}) $$ So: $$ \delta^{(2)} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} = (\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) $$ Now we can compute the gradients for layer 2's parameters: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{h}^{(1)})^T $$ $$ \frac{\partial \mathcal{L}}{\partial b^{(2)}} = \delta^{(2)} $$ **Hidden layer gradient:** To backpropagate to layer 1, we need: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} = \frac{\partial \mathcal{L}}{\partial z^{(2)}} \cdot \frac{\partial z^{(2)}}{\partial \mathbf{h}^{(1)}} = \delta^{(2)} \cdot \mathbf{W}^{(2)} $$ This is a key insight: the error signal at a layer equals the error signal from the next layer, multiplied by the weights connecting them. The weights "distribute" the error backwards. Then: $$ \delta^{(1)} = \frac{\partial \mathcal{L}}{\partial \mathbf{z}^{(1)}} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot \frac{\partial \mathbf{h}^{(1)}}{\partial \mathbf{z}^{(1)}} $$ where $\odot$ is element-wise multiplication. For ReLU: $$ \frac{\partial h_j^{(1)}}{\partial z_j^{(1)}} = \begin{cases} 1 & \text{if } z_j^{(1)} > 0 \\ 0 & \text{if } z_j^{(1)} \leq 0 \end{cases} $$ So $\delta^{(1)} = (\mathbf{W}^{(2)})^T \delta^{(2)} \odot \mathbf{1}_{z^{(1)} > 0}$, where $\mathbf{1}_{z^{(1)} > 0}$ is 1 where $z^{(1)}$ is positive and 0 otherwise. Finally: $$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \cdot \mathbf{x}^T $$ $$ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} = \delta^{(1)} $$ ### Concrete backpropagation example Let's compute gradients for our earlier forward pass example. We had: - Input: $\mathbf{x} = [1.0, 0.5]^T$ - Hidden activations: $\mathbf{z}^{(1)} = [0.5, -0.45, 0.2]^T$, $\mathbf{h}^{(1)} = [0.5, 0, 0.2]^T$ - Output: $z^{(2)} = 0.2$, $\hat{y} = \sigma(0.2) \approx 0.55$ Suppose the true label is $y = 1$ (so we want the output to be higher). **Step 1: Output layer error** $$ \delta^{(2)} = (\hat{y} - y) \cdot \hat{y}(1 - \hat{y}) = (0.55 - 1) \cdot 0.55 \cdot 0.45 = -0.45 \cdot 0.2475 \approx -0.111 $$ The negative sign indicates we should adjust to increase the output. **Step 2: Output layer gradients** $$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(2)}} = \delta^{(2)} \cdot (\mathbf{h}^{(1)})^T = -0.111 \cdot [0.5, 0, 0.2] = [-0.056, 0, -0.022] $$ $$ \frac{\partial \mathcal{L}}{\partial b^{(2)}} = -0.111 $$ **Step 3: Backpropagate to hidden layer** $$ \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} = (\mathbf{W}^{(2)})^T \cdot \delta^{(2)} = \begin{bmatrix} 0.6 \\ -0.4 \\ 0.5 \end{bmatrix} \cdot (-0.111) = \begin{bmatrix} -0.067 \\ 0.044 \\ -0.056 \end{bmatrix} $$ **Step 4: ReLU gradient** The ReLU gradient is 1 where $z^{(1)} > 0$ and 0 otherwise. Since $\mathbf{z}^{(1)} = [0.5, -0.45, 0.2]^T$, the mask is $[1, 0, 1]^T$. $$ \delta^{(1)} = \frac{\partial \mathcal{L}}{\partial \mathbf{h}^{(1)}} \odot [1, 0, 1]^T = [-0.067, 0, -0.056]^T $$ The gradient through the second neuron is zero because ReLU killed it during the forward pass. **Step 5: Hidden layer gradients** $$ \frac{\partial \mathcal{L}}{\partial \mathbf{W}^{(1)}} = \delta^{(1)} \cdot \mathbf{x}^T = \begin{bmatrix} -0.067 \\ 0 \\ -0.056 \end{bmatrix} \cdot [1.0, 0.5] = \begin{bmatrix} -0.067 & -0.033 \\ 0 & 0 \\ -0.056 & -0.028 \end{bmatrix} $$ $$ \frac{\partial \mathcal{L}}{\partial \mathbf{b}^{(1)}} = [-0.067, 0, -0.056]^T $$ These gradients tell us how to adjust each parameter to reduce the loss. ## Gradient descent With gradients in hand, we update parameters to reduce the loss. The simplest approach is **gradient descent**: $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} $$ where $\eta > 0$ is the **learning rate**. We move parameters in the direction opposite to the gradient (since the gradient points toward increasing loss). ### The learning rate The learning rate $\eta$ controls step size. Too large, and we might overshoot the minimum and diverge. Too small, and training takes forever. Consider minimizing $f(\theta) = \theta^2$. The minimum is at $\theta = 0$ with gradient $\frac{df}{d\theta} = 2\theta$. Starting at $\theta_0 = 1$: - With $\eta = 0.1$: $\theta_1 = 1 - 0.1 \cdot 2 = 0.8$, $\theta_2 = 0.8 - 0.1 \cdot 1.6 = 0.64$, ... converges slowly - With $\eta = 0.5$: $\theta_1 = 1 - 0.5 \cdot 2 = 0$, reaches minimum in one step! - With $\eta = 0.9$: $\theta_1 = 1 - 0.9 \cdot 2 = -0.8$, $\theta_2 = -0.8 - 0.9 \cdot (-1.6) = 0.64$, oscillates but converges - With $\eta = 1.1$: $\theta_1 = 1 - 1.1 \cdot 2 = -1.2$, $\theta_2 = -1.2 - 1.1 \cdot (-2.4) = 1.44$, diverges! For this simple quadratic, any $\eta < 1$ converges, $\eta = 0.5$ is optimal, and $\eta > 1$ diverges. Real loss landscapes are more complex, but the intuition holds: there's a "Goldilocks zone" for learning rates. ### Stochastic and mini-batch gradient descent Computing the gradient over all $m$ training examples is expensive. **Stochastic gradient descent** (SGD) uses a single random example: $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L}^{(i)} $$ where $\mathcal{L}^{(i)}$ is the loss for example $i$. This is noisy but much faster per update. **Mini-batch gradient descent** is a compromise: use a small batch of $B$ examples: $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \cdot \frac{1}{B} \sum_{i \in \text{batch}} \nabla_{\boldsymbol{\theta}} \mathcal{L}^{(i)} $$ Typical batch sizes are 32, 64, 128, or 256. Mini-batches reduce variance compared to SGD while remaining computationally efficient. ### Momentum and Adam Plain SGD can be slow, especially in "ravines" where the gradient points mostly sideways. **Momentum** accumulates a velocity: $$ \mathbf{v} \leftarrow \beta \mathbf{v} - \eta \nabla_{\boldsymbol{\theta}} \mathcal{L} $$ $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} + \mathbf{v} $$ where $\beta \approx 0.9$ is the momentum coefficient. This smooths out oscillations and accelerates along consistent gradient directions. **Adam** (Adaptive Moment Estimation) goes further by adapting the learning rate for each parameter based on historical gradients. It maintains running averages of the gradient and squared gradient: $$ \mathbf{m} \leftarrow \beta_1 \mathbf{m} + (1 - \beta_1) \nabla_{\boldsymbol{\theta}} \mathcal{L} $$ $$ \mathbf{v} \leftarrow \beta_2 \mathbf{v} + (1 - \beta_2) (\nabla_{\boldsymbol{\theta}} \mathcal{L})^2 $$ $$ \boldsymbol{\theta} \leftarrow \boldsymbol{\theta} - \eta \frac{\hat{\mathbf{m}}}{\sqrt{\hat{\mathbf{v}}} + \epsilon} $$ where $\hat{\mathbf{m}}$ and $\hat{\mathbf{v}}$ are bias-corrected estimates and $\epsilon \approx 10^{-8}$ prevents division by zero. Adam is the default optimizer for training transformers. ## Putting it together Let's summarize the training loop for a neural network: 1. **Initialize** parameters randomly (careful initialization matters; we'll discuss this later) 2. **Repeat** until convergence: a. Sample a mini-batch of training examples b. **Forward pass**: Compute predictions and loss c. **Backward pass**: Compute gradients via backpropagation d. **Update**: Adjust parameters using optimizer (SGD, Adam, etc.) 3. **Evaluate** on held-out data to check generalization This loop is the heartbeat of deep learning. Every neural network, from simple MLPs to massive transformers, trains this way. The differences lie in architecture (what functions the network computes) and scale (how many parameters, how much data, how much compute). In the next chapter, we'll see why standard feedforward networks struggle with sequential data, motivating the architectures that eventually led to transformers.