2 Calculus foundations
After completing this chapter, you will be able to:
- Compute limits and derivatives from first principles
- Apply the chain rule to composite functions
- Calculate partial derivatives and gradients for multivariable functions
- Derive the backpropagation algorithm using the chain rule
- Perform matrix calculus operations needed for neural network training
2.1 Limits: the idea of approaching
Before we can understand derivatives, we need to understand limits. A limit captures the idea of “approaching” a value, even if we never actually reach it.
Consider the function \(f(x) = \frac{x^2 - 1}{x - 1}\). What happens at \(x = 1\)? If we try to plug in \(x = 1\), we get \(\frac{0}{0}\), which is undefined. But let’s see what happens as \(x\) approaches 1:
| \(x\) | \(f(x) = \frac{x^2 - 1}{x - 1}\) |
|---|---|
| 0.9 | 1.9 |
| 0.99 | 1.99 |
| 0.999 | 1.999 |
| 1.001 | 2.001 |
| 1.01 | 2.01 |
| 1.1 | 2.1 |
As \(x\) gets closer to 1, \(f(x)\) gets closer to 2. We write this as:
\[ \lim_{x \to 1} \frac{x^2 - 1}{x - 1} = 2 \]
The function never equals 2 at \(x = 1\) (it’s undefined there), but it approaches 2 arbitrarily closely. We can verify this algebraically: \(\frac{x^2 - 1}{x - 1} = \frac{(x-1)(x+1)}{x-1} = x + 1\) for \(x \neq 1\), and \(x + 1 \to 2\) as \(x \to 1\).
Why do limits matter? Because many important quantities can’t be computed directly but can be defined as limits. The most important example is instantaneous rate of change.
2.2 Functions and derivatives
A function \(f: \mathbb{R} \to \mathbb{R}\) maps real numbers to real numbers. Suppose we want to know how fast \(f\) is changing at a specific point \(x\).
For average rate of change, we pick two points and compute the slope between them:
\[ \text{average rate of change} = \frac{f(x + h) - f(x)}{h} \]
This is the slope of the secant line connecting \((x, f(x))\) and \((x+h, f(x+h))\). But what if we want the instantaneous rate of change at exactly \(x\)? We can’t use \(h = 0\) because that gives \(\frac{0}{0}\). Instead, we take the limit as \(h\) approaches 0:
\[ \frac{df}{dx} = \lim_{h \to 0} \frac{f(x + h) - f(x)}{h} \]
This is the derivative of \(f\) at \(x\). Let’s compute a concrete example. For \(f(x) = x^2\), what is the derivative at \(x = 3\)?
\[ \frac{df}{dx}\bigg|_{x=3} = \lim_{h \to 0} \frac{(3+h)^2 - 3^2}{h} = \lim_{h \to 0} \frac{9 + 6h + h^2 - 9}{h} = \lim_{h \to 0} \frac{6h + h^2}{h} = \lim_{h \to 0} (6 + h) = 6 \]
Let’s verify this makes sense by computing the average rate of change for smaller and smaller \(h\):
| \(h\) | \(\frac{(3+h)^2 - 9}{h}\) |
|---|---|
| 1 | \(\frac{16-9}{1} = 7\) |
| 0.1 | \(\frac{9.61-9}{0.1} = 6.1\) |
| 0.01 | \(\frac{9.0601-9}{0.01} = 6.01\) |
| 0.001 | \(\frac{9.006001-9}{0.001} = 6.001\) |
As \(h\) shrinks, the average rate approaches 6. The derivative captures the instantaneous rate of change: at \(x = 3\), the function \(f(x) = x^2\) is increasing at a rate of 6 units of output per unit of input.
Geometrically, the derivative is the slope of the tangent line to the curve at that point. As \(h \to 0\), the secant line rotates and becomes the tangent line.
For general \(x\), we can derive that \(\frac{d}{dx}x^2 = 2x\). At \(x = 3\), this gives \(2 \cdot 3 = 6\), confirming our calculation.
Some essential derivatives to know:
\[\begin{align} \frac{d}{dx}x^n &= nx^{n-1} \\ \frac{d}{dx}e^x &= e^x \\ \frac{d}{dx}\ln x &= \frac{1}{x} \\ \frac{d}{dx}\sin x &= \cos x \\ \frac{d}{dx}\cos x &= -\sin x \end{align}\]
2.3 The chain rule: derivatives of composed functions
The chain rule is the most important rule in calculus for machine learning. Neural networks are compositions of many functions, and the chain rule tells us how to differentiate through all of them. Let’s build deep intuition for why it works.
Suppose we have two functions composed: \(y = f(u)\) where \(u = g(x)\). So \(x\) feeds into \(g\) to produce \(u\), then \(u\) feeds into \(f\) to produce \(y\):
\[ x \xrightarrow{g} u \xrightarrow{f} y \]
We want \(\frac{dy}{dx}\): how does \(y\) change when we change \(x\)? The chain rule says:
\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} \]
Why does multiplication make sense? Think about it in terms of small changes. Suppose:
- A small change \(\Delta x\) in \(x\) causes a change \(\Delta u\) in \(u\)
- That change \(\Delta u\) in \(u\) causes a change \(\Delta y\) in \(y\)
The ratio \(\frac{\Delta u}{\Delta x}\) tells us how much \(u\) changes per unit change in \(x\). The ratio \(\frac{\Delta y}{\Delta u}\) tells us how much \(y\) changes per unit change in \(u\). To find how much \(y\) changes per unit change in \(x\), we multiply:
\[ \frac{\Delta y}{\Delta x} = \frac{\Delta y}{\Delta u} \cdot \frac{\Delta u}{\Delta x} \]
The \(\Delta u\) terms “cancel” (though this is intuition, not rigorous proof). Taking the limit as all changes become infinitesimally small gives the chain rule.
Concrete example. Let \(y = (3x + 1)^2\). We can write this as \(y = u^2\) where \(u = 3x + 1\). Then:
- \(\frac{du}{dx} = 3\) (the inner function’s derivative)
- \(\frac{dy}{du} = 2u = 2(3x + 1)\) (the outer function’s derivative)
By the chain rule:
\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dx} = 2(3x + 1) \cdot 3 = 6(3x + 1) \]
Let’s verify with numbers. At \(x = 2\): \(u = 3(2) + 1 = 7\), so \(y = 49\).
- If \(x\) increases by a tiny amount \(\Delta x = 0.001\), then \(u\) becomes \(3(2.001) + 1 = 7.003\)
- The change in \(u\) is \(\Delta u = 0.003\), so \(\frac{\Delta u}{\Delta x} = \frac{0.003}{0.001} = 3\) ✓
- With \(u = 7.003\), \(y\) becomes \((7.003)^2 = 49.042009\)
- The change in \(y\) is \(\Delta y = 0.042009\), so \(\frac{\Delta y}{\Delta u} = \frac{0.042009}{0.003} \approx 14.003 \approx 2u\) ✓
- The total rate: \(\frac{\Delta y}{\Delta x} = \frac{0.042009}{0.001} = 42.009\)
Our formula predicts \(\frac{dy}{dx} = 6(3 \cdot 2 + 1) = 6 \cdot 7 = 42\). The numerical calculation gives 42.009, approaching 42 as \(\Delta x \to 0\). ✓
The amplification interpretation. Here’s another way to think about the chain rule. Each function in the chain acts as an “amplifier” for small changes:
- The function \(g\) amplifies changes in \(x\) by factor \(\frac{du}{dx}\)
- The function \(f\) amplifies changes in \(u\) by factor \(\frac{dy}{du}\)
- The total amplification is the product: \(\frac{dy}{du} \cdot \frac{du}{dx}\)
If \(g\) doubles small changes (derivative = 2) and \(f\) triples small changes (derivative = 3), then the composition multiplies small changes by \(2 \times 3 = 6\).
Longer chains. The chain rule extends naturally. If \(y = f(u)\), \(u = g(v)\), \(v = h(x)\), then:
\[ \frac{dy}{dx} = \frac{dy}{du} \cdot \frac{du}{dv} \cdot \frac{dv}{dx} \]
Each link in the chain contributes a multiplicative factor. This is exactly how backpropagation works in neural networks: we multiply the local derivatives at each layer to get the total derivative from output to input.
Why the chain rule matters for neural networks. Consider a simple neural network: input \(x\), hidden layer \(h = \sigma(wx + b)\), output \(y = vh\). To train this network, we need \(\frac{dy}{dw}\): how does the output change when we adjust the weight \(w\)?
Using the chain rule:
\[ \frac{dy}{dw} = \frac{dy}{dh} \cdot \frac{dh}{dw} = v \cdot \frac{d\sigma}{dw} \]
And \(\frac{d\sigma}{dw}\) requires another application of the chain rule (since \(\sigma\) depends on \(w\) through its argument \(wx + b\)). This cascading application of the chain rule through all layers is backpropagation.
2.4 Other differentiation rules
The product rule says: if \(y = f(x) \cdot g(x)\), then:
\[ \frac{dy}{dx} = \frac{df}{dx} \cdot g(x) + f(x) \cdot \frac{dg}{dx} \]
The intuition: when both \(f\) and \(g\) depend on \(x\), changing \(x\) affects \(y\) through both paths. The first term captures the effect of \(f\) changing while \(g\) stays fixed; the second captures \(g\) changing while \(f\) stays fixed.
The quotient rule says: if \(y = \frac{f(x)}{g(x)}\), then:
\[ \frac{dy}{dx} = \frac{\frac{df}{dx} \cdot g(x) - f(x) \cdot \frac{dg}{dx}}{g(x)^2} \]
2.5 Partial derivatives and gradients
2.5.1 The problem: optimizing functions of many variables
Here’s the situation that motivates everything in this section. You have a neural network with millions of parameters (weights). You have a loss function that measures how wrong the network’s predictions are. The loss depends on all those parameters:
\[ L = L(w_1, w_2, \ldots, w_{1000000}) \]
You want to find parameter values that make the loss small. How do you do it?
You can’t just try all possible combinations. With a million parameters, even if each could take only 10 values, you’d have \(10^{1000000}\) combinations to check. The universe doesn’t have enough atoms for that.
Instead, you need a smarter strategy: start somewhere, figure out which direction is downhill, take a step that way, repeat. This is gradient descent, and it requires answering a key question: given where you are, which direction reduces the loss most quickly?
That question is what partial derivatives and gradients answer.
2.5.2 Thinking geometrically: functions as landscapes
To build intuition, start with a function of just two variables: \(f(x, y)\). We can visualize this as a surface in 3D space, where the height at point \((x, y)\) is \(f(x, y)\).
Imagine standing on a hillside. The ground beneath you is the surface \(z = f(x, y)\). You’re at position \((x_0, y_0)\) at height \(f(x_0, y_0)\). You want to walk downhill as quickly as possible.
Which way should you go?
You could walk due east (increasing \(x\), keeping \(y\) fixed). How steep is that? The slope in the east direction is \(\frac{\partial f}{\partial x}\), the partial derivative with respect to \(x\).
You could walk due north (increasing \(y\), keeping \(x\) fixed). How steep is that? The slope in the north direction is \(\frac{\partial f}{\partial y}\), the partial derivative with respect to \(y\).
But you’re not limited to walking along coordinate axes. You could walk northeast, or any direction. The gradient \(\nabla f\) tells you the direction of steepest ascent. To go downhill fastest, walk in the direction \(-\nabla f\).
2.5.3 Partial derivatives: slopes along coordinate axes
The partial derivative \(\frac{\partial f}{\partial x}\) answers: “If I move only in the \(x\) direction, how fast does \(f\) change?”
Mechanically, you compute it by treating all other variables as constants and differentiating with respect to \(x\). But the meaning is geometric: it’s the slope of the surface in the \(x\) direction.
Consider \(f(x, y) = x^2 + y^2\). This is a paraboloid, a bowl opening upward with its bottom at the origin.
\[ \frac{\partial f}{\partial x} = 2x, \quad \frac{\partial f}{\partial y} = 2y \]
At the point \((3, 4)\):
- \(\frac{\partial f}{\partial x} = 6\): walking east, you’re climbing at slope 6
- \(\frac{\partial f}{\partial y} = 8\): walking north, you’re climbing at slope 8
At the origin \((0, 0)\):
- \(\frac{\partial f}{\partial x} = 0\), \(\frac{\partial f}{\partial y} = 0\): the surface is flat in both directions
This makes sense. The paraboloid has its minimum at the origin, where the surface is horizontal. As you move away from the origin, the slopes increase.
The formal definition is:
\[ \frac{\partial f}{\partial x} = \lim_{h \to 0} \frac{f(x + h, y) - f(x, y)}{h} \]
Notice that \(y\) doesn’t change. We’re measuring the slope along a slice where \(y\) is held constant.
2.5.4 The gradient: the compass pointing uphill
The gradient combines all partial derivatives into a vector:
\[ \nabla f = \begin{bmatrix} \frac{\partial f}{\partial x} \\ \frac{\partial f}{\partial y} \end{bmatrix} \]
For \(f(x, y) = x^2 + y^2\):
\[ \nabla f = \begin{bmatrix} 2x \\ 2y \end{bmatrix} \]
At the point \((3, 4)\): \(\nabla f = \begin{bmatrix} 6 \\ 8 \end{bmatrix}\).
This vector has a profound meaning: it points in the direction (in the x-y plane) of steepest ascent. The gradient lives in the input space, not in 3D. It tells you which way to walk horizontally to climb most steeply.
Why? At first this might seem wrong. Going north gives slope 8. Going east gives slope 6. Isn’t north the best direction? Why would combining them help?
The key insight: when you walk diagonally, you’re not diluting the good direction with the bad one. You’re harvesting height gain from both directions simultaneously.
Let’s compute the height gain for a unit step in each direction:
- North \([0, 1]\): Move 0 in \(x\), 1 in \(y\). Height gain \(= 0 \times 6 + 1 \times 8 = 8\)
- East \([1, 0]\): Move 1 in \(x\), 0 in \(y\). Height gain \(= 1 \times 6 + 0 \times 8 = 6\)
- Gradient direction \([0.6, 0.8]\): Move 0.6 in \(x\), 0.8 in \(y\). Height gain \(= 0.6 \times 6 + 0.8 \times 8 = 3.6 + 6.4 = 10\)
The diagonal step gets 3.6 from the \(x\)-component AND 6.4 from the \(y\)-component. These contributions add up to more than either pure direction alone.
But why is \([0.6, 0.8]\) the best proportion? Why not \([0.5, 0.5]\) or \([0.1, 0.9]\)?
We want to maximize height gain \(= 6 u_1 + 8 u_2\), subject to taking a unit step: \(u_1^2 + u_2^2 = 1\).
The constraint is crucial. We have a “budget” of one unit of movement to allocate between \(x\) and \(y\). But the budget is circular (Euclidean), not linear. This matters.
If the budget were linear (\(u_1 + u_2 = 1\)), we’d put everything into \(y\) since it pays better. But with a circular budget, we can do something clever.
Consider what happens as we tilt from north toward the gradient direction:
- \([0, 1]\): Height gain \(= 0 \times 6 + 1 \times 8 = 8\)
- \([0.6, 0.8]\): Height gain \(= 0.6 \times 6 + 0.8 \times 8 = 10\)
By giving up only 0.2 units of \(y\)-movement (from 1 to 0.8), we gain 0.6 units of \(x\)-movement. The circular constraint means small sacrifices in one direction buy disproportionately large gains in the other.
Why does \([0.6, 0.8]\) hit the sweet spot? Because it’s proportional to the payoffs \([6, 8]\). The direction that maximizes \(6u_1 + 8u_2\) on the unit circle points the same way as \([6, 8]\) itself. Normalizing: \([6, 8]/\sqrt{36+64} = [6, 8]/10 = [0.6, 0.8]\).
The gradient automatically encodes the right trade-off. Larger partial derivative means that direction is more valuable, so we tilt toward it, in exact proportion to how much more valuable it is.
What we just computed has a name: the directional derivative. The slope in direction \(\mathbf{u}\) is:
\[ D_\mathbf{u} f = \nabla f \cdot \mathbf{u} = \frac{\partial f}{\partial x} u_1 + \frac{\partial f}{\partial y} u_2 \]
This formula captures exactly what we did: multiply each partial derivative by how much we move in that direction, then add up the contributions. The dot product is just a compact way to write “weight each slope by the corresponding component of \(\mathbf{u}\), then sum.”
Since the directional derivative is a dot product, we can use the geometric formula: \(\nabla f \cdot \mathbf{u} = \|\nabla f\| \cos\theta\), where \(\theta\) is the angle between them. This is maximized when \(\theta = 0\) (vectors aligned), giving maximum slope \(\|\nabla f\|\). For our example: \(\sqrt{6^2 + 8^2} = 10\), exactly what we found.
One more observation: the gradient \([6, 8]\) points directly away from the origin. On our bowl-shaped paraboloid, the steepest way up is radially outward. The gradient captures this geometric fact automatically.
For gradient descent, we go the opposite way: \(-\nabla f = [-6, -8]\) points toward the origin, toward the minimum. This is why gradient descent works.
2.5.6 The chain rule: tracking influence through layers
Neural networks are compositions of functions. The input passes through layer 1, then layer 2, then layer 3, and so on. The loss depends on the output, which depends on the intermediate layers, which depend on the parameters.
How does a small change in a parameter \(w\) in layer 1 affect the final loss?
The change propagates forward through all subsequent layers. We need to track this chain of influence.
Single path. If \(x\) affects \(y\), and \(y\) affects \(z\), and these are the only connections:
\[ x \to y \to z \]
Then \(\frac{dz}{dx} = \frac{dz}{dy} \cdot \frac{dy}{dx}\). We multiply the rates of change along the path.
Multiple paths. But often \(x\) can influence \(z\) through multiple intermediate variables:
\[ x \to \begin{cases} y_1 \\ y_2 \end{cases} \to z \]
Here \(x\) affects \(y_1\) and \(y_2\), and both affect \(z\). A small change \(\Delta x\) causes:
- Change in \(y_1\): \(\Delta y_1 \approx \frac{dy_1}{dx} \Delta x\)
- Change in \(y_2\): \(\Delta y_2 \approx \frac{dy_2}{dx} \Delta x\)
These changes independently affect \(z\):
- Effect through \(y_1\): \(\frac{\partial z}{\partial y_1} \Delta y_1\)
- Effect through \(y_2\): \(\frac{\partial z}{\partial y_2} \Delta y_2\)
Total effect: sum them up.
\[ \frac{dz}{dx} = \frac{\partial z}{\partial y_1}\frac{dy_1}{dx} + \frac{\partial z}{\partial y_2}\frac{dy_2}{dx} \]
The general rule: multiply along each path (chain rule), then sum over all paths.
Why sum? Because the effects through different paths are independent and additive. Changing \(x\) slightly perturbs both \(y_1\) and \(y_2\), and \(z\) feels both perturbations. There’s no interaction between the paths at the linear (first-order) level.
Why multiply along a path? Because each link in the chain amplifies or attenuates the change. If \(y\) doubles when \(x\) increases by 1, and \(z\) triples when \(y\) increases by 1, then \(z\) increases by \(2 \times 3 = 6\) when \(x\) increases by 1.
2.5.7 The Jacobian: all derivatives at once
When we have vector inputs \(\mathbf{x} \in \mathbb{R}^n\) and vector outputs \(\mathbf{y} \in \mathbb{R}^m\), we need \(m \times n\) partial derivatives: how does each output depend on each input?
The Jacobian matrix organizes all of these:
\[ \mathbf{J} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \cdots & \frac{\partial y_1}{\partial x_n} \\ \vdots & \ddots & \vdots \\ \frac{\partial y_m}{\partial x_1} & \cdots & \frac{\partial y_m}{\partial x_n} \end{bmatrix} \]
Row \(j\) contains the gradient of \(y_j\). Column \(i\) tells how all outputs respond to input \(x_i\).
Concrete example. Consider a function \(\mathbf{y} = f(\mathbf{x})\) where:
\[ y_1 = x_1^2 + x_2, \quad y_2 = x_1 x_2 \]
The Jacobian is:
\[ \mathbf{J} = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 2x_1 & 1 \\ x_2 & x_1 \end{bmatrix} \]
At the point \((x_1, x_2) = (3, 2)\), the outputs are \(y_1 = 9 + 2 = 11\) and \(y_2 = 6\), and the Jacobian is:
\[ \mathbf{J} = \begin{bmatrix} 6 & 1 \\ 2 & 3 \end{bmatrix} \]
What does this matrix tell us? It predicts how small input changes affect outputs:
\[ \Delta \mathbf{y} \approx \mathbf{J} \Delta \mathbf{x} \]
Suppose we nudge the input by \(\Delta \mathbf{x} = [0.1, 0.2]^T\). The Jacobian predicts:
\[ \Delta \mathbf{y} \approx \begin{bmatrix} 6 & 1 \\ 2 & 3 \end{bmatrix} \begin{bmatrix} 0.1 \\ 0.2 \end{bmatrix} = \begin{bmatrix} 6(0.1) + 1(0.2) \\ 2(0.1) + 3(0.2) \end{bmatrix} = \begin{bmatrix} 0.8 \\ 0.8 \end{bmatrix} \]
Let’s verify. At \((3.1, 2.2)\):
- \(y_1 = (3.1)^2 + 2.2 = 9.61 + 2.2 = 11.81\). Change: \(11.81 - 11 = 0.81\) ✓
- \(y_2 = (3.1)(2.2) = 6.82\). Change: \(6.82 - 6 = 0.82\) ✓
The predictions (0.8, 0.8) are close to the actual changes (0.81, 0.82). The small discrepancy is because the Jacobian gives a linear approximation, which is only exact for infinitesimal changes.
The Jacobian is the multivariate generalization of the derivative. Just as \(\frac{df}{dx}\) tells us how a scalar function responds to a scalar input, the Jacobian tells us how a vector function responds to a vector input.
2.5.8 Backpropagation: the chain rule in matrix form
Consider a neural network as a chain:
\[ \mathbf{x} \xrightarrow{\text{layer 1}} \mathbf{h}_1 \xrightarrow{\text{layer 2}} \mathbf{h}_2 \xrightarrow{\cdots} \mathbf{h}_L \xrightarrow{\text{loss}} L \]
We want \(\nabla_\mathbf{x} L\): how does the loss depend on the input (or on parameters at each layer)?
Working backward from the loss:
- \(\nabla_{\mathbf{h}_L} L\) is the gradient at the last hidden layer
- \(\nabla_{\mathbf{h}_{L-1}} L = \mathbf{J}_L^T \nabla_{\mathbf{h}_L} L\), where \(\mathbf{J}_L\) is the Jacobian of layer \(L\)
- \(\nabla_{\mathbf{h}_{L-2}} L = \mathbf{J}_{L-1}^T \nabla_{\mathbf{h}_{L-1}} L\)
- … and so on back to the input
Concrete example. Let’s trace through a tiny two-layer network:
\[ \mathbf{x} = \begin{bmatrix} x_1 \\ x_2 \end{bmatrix} \xrightarrow{\text{layer 1}} \mathbf{h} = \begin{bmatrix} h_1 \\ h_2 \end{bmatrix} \xrightarrow{\text{layer 2}} L \]
where \(h_1 = x_1 + x_2\), \(h_2 = x_1 - x_2\), and \(L = h_1^2 + h_2^2\).
At \(\mathbf{x} = [3, 1]^T\): \(\mathbf{h} = [4, 2]^T\) and \(L = 16 + 4 = 20\).
Step 1: Gradient at output. How does \(L\) depend on \(\mathbf{h}\)?
\[ \nabla_\mathbf{h} L = \begin{bmatrix} \frac{\partial L}{\partial h_1} \\ \frac{\partial L}{\partial h_2} \end{bmatrix} = \begin{bmatrix} 2h_1 \\ 2h_2 \end{bmatrix} = \begin{bmatrix} 8 \\ 4 \end{bmatrix} \]
Step 2: Jacobian of layer 1. How does \(\mathbf{h}\) depend on \(\mathbf{x}\)?
\[ \mathbf{J} = \begin{bmatrix} \frac{\partial h_1}{\partial x_1} & \frac{\partial h_1}{\partial x_2} \\ \frac{\partial h_2}{\partial x_1} & \frac{\partial h_2}{\partial x_2} \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \]
Step 3: Backpropagate. Multiply by \(\mathbf{J}^T\):
\[ \nabla_\mathbf{x} L = \mathbf{J}^T \nabla_\mathbf{h} L = \begin{bmatrix} 1 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 8 \\ 4 \end{bmatrix} = \begin{bmatrix} 8 + 4 \\ 8 - 4 \end{bmatrix} = \begin{bmatrix} 12 \\ 4 \end{bmatrix} \]
Let’s verify by direct calculation. Substituting: \(L = (x_1 + x_2)^2 + (x_1 - x_2)^2 = 2x_1^2 + 2x_2^2\).
\[ \nabla_\mathbf{x} L = \begin{bmatrix} 4x_1 \\ 4x_2 \end{bmatrix} = \begin{bmatrix} 12 \\ 4 \end{bmatrix} \quad \checkmark \]
Why the transpose? Look at row 1 of the calculation: \((\nabla_\mathbf{x} L)_1 = 1 \cdot 8 + 1 \cdot 4 = 12\). This is:
\[ \frac{\partial L}{\partial x_1} = \frac{\partial L}{\partial h_1}\frac{\partial h_1}{\partial x_1} + \frac{\partial L}{\partial h_2}\frac{\partial h_2}{\partial x_1} \]
We’re summing over the intermediate variables \(h_1, h_2\). This sum is a dot product of \(\nabla_\mathbf{h} L\) with column 1 of \(\mathbf{J}\), which equals row 1 of \(\mathbf{J}^T\) times \(\nabla_\mathbf{h} L\). That’s why we use the transpose.
The gradient flows backward through the network, getting transformed by each layer’s Jacobian transpose. This is backpropagation.
The key insight: we multiply Jacobians across layers (because effects compound through the chain), and we sum within each Jacobian-vector product (because each input affects multiple intermediates, and we must account for all paths).
This interplay of multiplication and summation, flowing backward through the network, is how neural networks learn.
2.6 Matrix calculus
When we differentiate with respect to vectors and matrices, we need to be careful about dimensions. If \(f: \mathbb{R}^n \to \mathbb{R}\) and \(\mathbf{x} \in \mathbb{R}^n\), then \(\nabla_\mathbf{x} f\) is an \(n\)-dimensional column vector. If \(\mathbf{f}: \mathbb{R}^n \to \mathbb{R}^m\) and \(\mathbf{x} \in \mathbb{R}^n\), then the Jacobian \(\mathbf{J} = \nabla_\mathbf{x} \mathbf{f}\) is \(m \times n\).
Here are some useful formulas, each with a concrete example. Let \(\mathbf{x} \in \mathbb{R}^n\), \(\mathbf{a} \in \mathbb{R}^n\), \(\mathbf{A} \in \mathbb{R}^{n \times n}\).
Formula 1: \(\nabla_\mathbf{x} (\mathbf{a}^T\mathbf{x}) = \mathbf{a}\)
This says: the gradient of a linear function is constant. The function \(\mathbf{a}^T\mathbf{x} = a_1 x_1 + a_2 x_2 + \cdots\) increases at rate \(a_i\) per unit increase in \(x_i\), regardless of where you are.
Example: Let \(\mathbf{a} = [3, 2]^T\) and \(f(\mathbf{x}) = \mathbf{a}^T\mathbf{x} = 3x_1 + 2x_2\).
\[ \nabla f = \begin{bmatrix} \frac{\partial}{\partial x_1}(3x_1 + 2x_2) \\ \frac{\partial}{\partial x_2}(3x_1 + 2x_2) \end{bmatrix} = \begin{bmatrix} 3 \\ 2 \end{bmatrix} = \mathbf{a} \quad \checkmark \]
Formula 2: \(\nabla_\mathbf{x} (\mathbf{x}^T\mathbf{x}) = 2\mathbf{x}\)
This is the gradient of the squared norm. The function \(\mathbf{x}^T\mathbf{x} = x_1^2 + x_2^2 + \cdots\) is a paraboloid centered at the origin.
Example: Let \(\mathbf{x} = [3, 4]^T\). Then \(f = \mathbf{x}^T\mathbf{x} = 9 + 16 = 25\).
\[ \nabla f = 2\mathbf{x} = \begin{bmatrix} 6 \\ 8 \end{bmatrix} \]
This is exactly the gradient we computed earlier for \(f(x,y) = x^2 + y^2\) at \((3, 4)\)!
Formula 3: \(\nabla_\mathbf{x} (\mathbf{x}^T\mathbf{A}\mathbf{x}) = (\mathbf{A} + \mathbf{A}^T)\mathbf{x}\)
This is the gradient of a quadratic form. If \(\mathbf{A}\) is symmetric (\(\mathbf{A} = \mathbf{A}^T\)), this simplifies to \(2\mathbf{A}\mathbf{x}\).
Example: Let \(\mathbf{A} = \begin{bmatrix} 1 & 2 \\ 0 & 3 \end{bmatrix}\) and \(\mathbf{x} = [1, 1]^T\).
First, compute \(f = \mathbf{x}^T\mathbf{A}\mathbf{x}\):
\[ \mathbf{A}\mathbf{x} = \begin{bmatrix} 1 & 2 \\ 0 & 3 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 3 \\ 3 \end{bmatrix}, \quad f = \mathbf{x}^T(\mathbf{A}\mathbf{x}) = [1, 1]\begin{bmatrix} 3 \\ 3 \end{bmatrix} = 6 \]
Now the gradient. We have \(\mathbf{A} + \mathbf{A}^T = \begin{bmatrix} 1 & 2 \\ 0 & 3 \end{bmatrix} + \begin{bmatrix} 1 & 0 \\ 2 & 3 \end{bmatrix} = \begin{bmatrix} 2 & 2 \\ 2 & 6 \end{bmatrix}\).
\[ \nabla f = (\mathbf{A} + \mathbf{A}^T)\mathbf{x} = \begin{bmatrix} 2 & 2 \\ 2 & 6 \end{bmatrix}\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ 8 \end{bmatrix} \]
Let’s verify by expanding \(f\) and differentiating directly. \(f = \mathbf{x}^T\mathbf{A}\mathbf{x} = x_1(x_1 + 2x_2) + x_2(3x_2) = x_1^2 + 2x_1 x_2 + 3x_2^2\).
\[ \frac{\partial f}{\partial x_1} = 2x_1 + 2x_2 = 4, \quad \frac{\partial f}{\partial x_2} = 2x_1 + 6x_2 = 8 \quad \checkmark \]
Formula 4: \(\nabla_\mathbf{x} \|\mathbf{x}\|_2 = \frac{\mathbf{x}}{\|\mathbf{x}\|_2}\)
This is the gradient of the norm itself (not squared). It points radially outward with unit length.
Example: Let \(\mathbf{x} = [3, 4]^T\). Then \(\|\mathbf{x}\|_2 = 5\).
\[ \nabla \|\mathbf{x}\|_2 = \frac{1}{5}\begin{bmatrix} 3 \\ 4 \end{bmatrix} = \begin{bmatrix} 0.6 \\ 0.8 \end{bmatrix} \]
This is a unit vector pointing in the direction of \(\mathbf{x}\). The norm increases at rate 1 per unit step directly away from the origin.
2.7 Important activation functions
Several nonlinear functions appear repeatedly in neural networks. The sigmoid function is:
\[ \sigma(x) = \frac{1}{1 + e^{-x}} \]
It squashes inputs to the range \((0, 1)\). The derivative is:
\[ \frac{d\sigma}{dx} = \sigma(x)(1 - \sigma(x)) \]
To prove this, write \(\sigma(x) = (1 + e^{-x})^{-1}\) and use the chain rule:
\[ \frac{d\sigma}{dx} = -(1 + e^{-x})^{-2} \cdot (-e^{-x}) = \frac{e^{-x}}{(1 + e^{-x})^2} = \frac{1}{1 + e^{-x}} \cdot \frac{e^{-x}}{1 + e^{-x}} = \sigma(x)(1 - \sigma(x)) \]
The hyperbolic tangent is:
\[ \tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} \]
It squashes inputs to \((-1, 1)\). The derivative is:
\[ \frac{d}{dx}\tanh(x) = 1 - \tanh^2(x) \]
The ReLU (rectified linear unit) is:
\[ \text{ReLU}(x) = \max(0, x) = \begin{cases} x & \text{if } x > 0 \\ 0 & \text{if } x \leq 0 \end{cases} \]
The derivative is:
\[ \frac{d}{dx}\text{ReLU}(x) = \begin{cases} 1 & \text{if } x > 0 \\ 0 & \text{if } x < 0 \\ \text{undefined} & \text{if } x = 0 \end{cases} \]
In practice we define it to be 0 or 1 at \(x = 0\). ReLU is simple but effective, and it doesn’t saturate for positive values.
The GELU (Gaussian error linear unit) is:
\[ \text{GELU}(x) = x \cdot \Phi(x) \]
where \(\Phi(x) = \frac{1}{2}[1 + \text{erf}(x/\sqrt{2})]\) is the cumulative distribution function of the standard normal. GELU is commonly used in transformers like GPT. It’s smoother than ReLU and has nice probabilistic interpretations.
2.8 The softmax function
The softmax function is crucial for transformers. It maps a vector \(\mathbf{z} \in \mathbb{R}^n\) to a probability distribution:
\[ \text{softmax}(\mathbf{z})_i = \frac{e^{z_i}}{\sum_{j=1}^n e^{z_j}} \]
The outputs are positive and sum to 1, so they can be interpreted as probabilities. Softmax is a smooth, differentiable approximation to the argmax function. When one component of \(\mathbf{z}\) is much larger than the others, softmax puts almost all probability mass there.
The Jacobian of softmax has a special structure. Let \(\mathbf{p} = \text{softmax}(\mathbf{z})\). Then:
\[ \frac{\partial p_i}{\partial z_j} = \begin{cases} p_i(1 - p_i) & \text{if } i = j \\ -p_i p_j & \text{if } i \neq j \end{cases} \]
In matrix form:
\[ \mathbf{J} = \text{diag}(\mathbf{p}) - \mathbf{p}\mathbf{p}^T \]
Let’s derive this. Start with:
\[ p_i = \frac{e^{z_i}}{\sum_k e^{z_k}} \]
For \(i = j\):
\[ \frac{\partial p_i}{\partial z_i} = \frac{e^{z_i} \sum_k e^{z_k} - e^{z_i} \cdot e^{z_i}}{(\sum_k e^{z_k})^2} = \frac{e^{z_i}}{\sum_k e^{z_k}} \cdot \frac{\sum_k e^{z_k} - e^{z_i}}{\sum_k e^{z_k}} = p_i(1 - p_i) \]
For \(i \neq j\):
\[ \frac{\partial p_i}{\partial z_j} = \frac{0 \cdot \sum_k e^{z_k} - e^{z_i} \cdot e^{z_j}}{(\sum_k e^{z_k})^2} = -\frac{e^{z_i}}{\sum_k e^{z_k}} \cdot \frac{e^{z_j}}{\sum_k e^{z_k}} = -p_i p_j \]
In attention mechanisms, we apply softmax to similarity scores to get attention weights. Understanding how gradients flow through softmax is essential for understanding how attention is learned.