3 Probability basics

Learning objectives

After completing this chapter, you will be able to:

Distinguish between discrete and continuous random variables
Calculate expectations, variances, and standard deviations
Apply Bayes’ theorem to update beliefs given evidence
Compute KL divergence and cross-entropy between distributions
Understand why cross-entropy is the natural loss function for classification

Probability is the language of uncertainty. In machine learning, we use it everywhere: our model’s output is a probability distribution over possible next tokens, our training objective measures how well predicted probabilities match observed data, and our understanding of why networks work at all relies on probabilistic reasoning.

This chapter develops probability from the ground up, focusing on the concepts that matter most for understanding transformers.

3.1 A note on notation

In linear algebra, we used bold uppercase for matrices ($\mathbf{W}$, $\mathbf{X}$) and bold lowercase for vectors ($\mathbf{v}$, $\mathbf{x}$).

Probability introduces a different convention: random variables are written in italic uppercase without bold: $X$, $Y$, $Z$. This is standard notation in probability theory.

Symbol	Meaning	Example
$\mathbf{X}$	Matrix (bold)	A $3 \times 3$ array of weights
$\mathbf{x}$	Vector (bold lowercase)	A point in 3D space: $[1, 2, 3]^T$
$X$	Random variable (italic, not bold)	The outcome of rolling a die
$x$	A specific value (lowercase)	The number 4

A random variable isn’t a number—it’s a placeholder for an outcome that hasn’t been determined yet. Think of it as the question rather than the answer.

Concrete example: You’re about to roll a die.

Before rolling: $X$ represents the uncertain outcome. We don’t know what it will be yet. $X$ could be 1, 2, 3, 4, 5, or 6.
After rolling: You got a 4. Now $x = 4$ is the specific value that occurred.

The notation $P(X = 4)$ asks: “Before rolling, what’s the probability that $X$ will equal 4?” For a fair die, $P(X = 4) = \frac{1}{6}$.

We write $p(x)$ as shorthand for $P(X = x)$—the probability that the random variable $X$ takes value $x$. So for our die:

$p(1) = P(X = 1) = \frac{1}{6}$
$p(2) = P(X = 2) = \frac{1}{6}$
…
$p(6) = P(X = 6) = \frac{1}{6}$

Another example: Let $Y$ be “the next word a language model predicts.”

$Y$ is the random variable—the uncertain outcome
$y$ = “cat” is one possible value
$P(Y = \text{"cat"}) = 0.15$ means the model assigns 15% probability to “cat”
$p(\text{"cat"}) = 0.15$ is the same thing, shorter notation

3.2 Why probability for neural networks?

Consider what a language model does. Given the input “The cat sat on the”, what should the model output? Not a single word—that would be too confident. The model should express uncertainty: “mat” is likely, “floor” is possible, “elephant” is unlikely but not impossible.

The model outputs a probability distribution over the vocabulary:

Word	Probability
mat	0.35
floor	0.20
rug	0.15
bed	0.10
…	…
elephant	0.0001

These probabilities must be non-negative and sum to 1. They express the model’s beliefs about what comes next.

Training the model means adjusting its parameters so that its predicted probabilities align with reality. If the true next word was “mat,” we want the model to assign high probability to “mat.” This is where the loss function comes in—but to understand loss functions, we need to understand probability properly.

3.3 Probability as a measure of belief

What does “$P(\text{mat}) = 0.35$” mean? There are two interpretations:

Frequentist: If we saw “The cat sat on the” many times, about 35% of the time the next word would be “mat.”

Bayesian: The model assigns a 35% degree of belief to “mat” being the next word.

For neural networks, the Bayesian interpretation is more useful. The model doesn’t have access to infinite repetitions—it has seen some training data and formed beliefs based on it.

The key constraint on probabilities is that they must form a valid distribution:

Every probability is non-negative: $P(x) \geq 0$ for all $x$
Probabilities sum to 1: $\sum_x P(x) = 1$

These aren’t arbitrary rules—they ensure probabilities behave sensibly. If we’re certain one thing will happen, we assign it probability 1 and everything else probability 0. If we’re completely uncertain among $n$ options, we assign each probability $1/n$.

3.4 Discrete vs continuous

A discrete random variable takes countably many values (like words in a vocabulary). A continuous random variable takes values in a continuum (like the real numbers).

For discrete variables, we have a probability mass function (PMF):

\[ p(x) = P(X = x) \]

Each value $x$ has a specific probability.

Example: A fair die has $p(1) = p(2) = \cdots = p(6) = \frac{1}{6}$.

For continuous variables, we can’t assign probability to individual points (there are infinitely many). Instead, we have a probability density function (PDF) $f(x)$, and probability comes from integration:

\[ P(a \leq X \leq b) = \int_a^b f(x) \, dx \]

The density $f(x)$ tells us how “concentrated” probability is near $x$. High density means outcomes near $x$ are more likely; low density means they’re less likely.

In transformers, we mostly work with discrete distributions (over vocabulary tokens), but continuous distributions appear in weight initialization and analysis.

3.5 The mean (average)

You already know what an average is. Roll a die 6 times, get 2, 5, 3, 6, 1, 4, add them up, divide by 6:

\[ \text{average} = \frac{2 + 5 + 3 + 6 + 1 + 4}{6} = \frac{21}{6} = 3.5 \]

But what if we haven’t rolled yet? We can still compute the theoretical average—what we’d expect to get if we rolled many times. For a fair die, each number 1–6 is equally likely (probability $\frac{1}{6}$), so:

\[ \text{mean} = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 \]

Each outcome contributes its value times its probability. This is a weighted average, where the weights are probabilities.

In statistics, this theoretical average is called the expectation or expected value, written $\mathbb{E}[X]$:

Friendly term	Formal term	Notation
Mean, average	Expectation	$\mathbb{E}[X]$

The formula:

\[ \text{mean} = \mathbb{E}[X] = \sum_{\text{all outcomes } x} x \times P(x) \]

Or more compactly: $\mathbb{E}[X] = \sum_x x \cdot p(x)$

Example: loaded die

Suppose a loaded die has: - 50% chance of rolling 6 - 10% chance each for 1, 2, 3, 4, 5

\[ \text{mean} = 1 \times 0.1 + 2 \times 0.1 + 3 \times 0.1 + 4 \times 0.1 + 5 \times 0.1 + 6 \times 0.5 \] \[ = 0.1 + 0.2 + 0.3 + 0.4 + 0.5 + 3.0 = 4.5 \]

The loaded die’s mean (4.5) is higher than the fair die’s mean (3.5) because high outcomes are more likely.

Example: language model

A language model predicts the next word with probabilities. Suppose the possible words and their probabilities are:

Word	Probability	“Value” (arbitrary score)
cat	0.4	10
dog	0.3	8
bird	0.2	6
fish	0.1	4

Mean score: $10 \times 0.4 + 8 \times 0.3 + 6 \times 0.2 + 4 \times 0.1 = 4 + 2.4 + 1.2 + 0.4 = 8.0$

Key property: The mean is linear. If you have two uncertain quantities and add them:

\[ \text{mean of } (aX + bY) = a \times \text{mean of } X + b \times \text{mean of } Y \]

In formal notation: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$

This works even if $X$ and $Y$ are related to each other. Linearity makes means easy to work with.

3.6 Spread (variance)

The mean tells us where the distribution is centered. But two distributions can have the same mean yet look very different:

Distribution A: always returns 5
Distribution B: returns 0 or 10, each with 50% chance

Both have mean 5. But A always gives exactly 5, while B is all over the place. We need a number that measures this spread—how far outcomes typically fall from the mean.

Friendly term	Formal term	Notation
Spread	Variance	$\text{Var}(X)$

How should we measure spread?

Attempt 1: average distance from mean

For distribution B, let’s compute how far each outcome is from the mean (5): - Outcome 0: distance from mean $= 0 - 5 = -5$ - Outcome 10: distance from mean $= 10 - 5 = +5$

Average distance: $\frac{(-5) + (+5)}{2} = 0$

That’s wrong! B is clearly spread out, but we got zero. The problem: positive and negative distances cancelled out.

Attempt 2: average of absolute distances

Let’s use absolute values to prevent cancellation: - $|0 - 5| = 5$ - $|10 - 5| = 5$

Average: $\frac{5 + 5}{2} = 5$

This works! But absolute value has a problem for machine learning: it’s not smooth. The function $|x|$ has a sharp corner at $x = 0$, which causes issues for gradient descent (we need derivatives, and corners don’t have well-defined derivatives).

Attempt 3: average of squared distances (this is variance)

Instead of absolute value, let’s square the distances: - $(0 - 5)^2 = 25$ - $(10 - 5)^2 = 25$

Average: $\frac{25 + 25}{2} = 25$

This is the variance. In words:

\[ \text{spread} = \text{average of (distance from mean)}^2 \]

In formal notation:

\[ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] \]

Which reads as: “the expected value of the squared deviation from the expected value.”

Why squared distances?

No cancellation: Squares are always positive
Smooth function: $x^2$ has derivatives everywhere—perfect for gradient descent
Punishes big deviations: Being 10 away contributes $100$, being 2 away contributes only $4$
Adds nicely: For independent random variables, $\text{spread of } (X + Y) = \text{spread of } X + \text{spread of } Y$

Computing spread step by step

Distribution A (always 5):

Mean = 5. Only outcome is 5.

\[ \text{spread} = (5 - 5)^2 \times 1 = 0 \]

Zero spread—every sample equals the mean.

Distribution B (0 or 10 with equal probability):

Mean = $0 \times 0.5 + 10 \times 0.5 = 5$

\[ \text{spread} = (0 - 5)^2 \times 0.5 + (10 - 5)^2 \times 0.5 = 25 \times 0.5 + 25 \times 0.5 = 25 \]

Distribution C (outcomes 4, 5, 6 with equal probability):

Mean = $\frac{4 + 5 + 6}{3} = 5$

\[ \text{spread} = (4-5)^2 \times \frac{1}{3} + (5-5)^2 \times \frac{1}{3} + (6-5)^2 \times \frac{1}{3} \] \[ = 1 \times \frac{1}{3} + 0 \times \frac{1}{3} + 1 \times \frac{1}{3} = \frac{2}{3} \approx 0.67 \]

Summary: Three distributions, all with mean 5, but different spreads:

Distribution	Outcomes	Spread (variance)
A	Always 5	0
C	4, 5, or 6	0.67
B	0 or 10	25

A computational shortcut

There’s an equivalent formula that’s often easier to compute:

\[ \text{spread} = \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 \]

In words: “mean of squares” minus “square of mean.”

Let’s verify for distribution B: - Mean: $\mathbb{E}[X] = 5$ - $\mathbb{E}[X^2] = 0^2 \cdot 0.5 + 10^2 \cdot 0.5 = 0 + 50 = 50$ - $\text{Var}(X) = 50 - 5^2 = 50 - 25 = 25$ ✓

Standard deviation: The problem with variance is that it’s in squared units. If $X$ is in meters, variance is in meters². To get back to the original units, take the square root:

\[ \text{standard deviation} = \sigma = \sqrt{\text{variance}} \]

For distribution B: $\sigma = \sqrt{25} = 5$

Friendly term	Formal term	Notation
Spread (squared)	Variance	$\text{Var}(X)$
Spread (same units)	Standard deviation	$\sigma$ or $\text{std}(X)$

Important clarification: Standard deviation is the typical distance from the mean, not a typical value you’d see. Distribution B has outcomes 0 and 10, mean 5, standard deviation 5. You’ll never actually get a 5—but when you get 0 or 10, you’re always exactly 5 away from the mean. That’s what σ = 5 is telling you: “values typically land about 5 units away from the mean.”

Why spread matters for neural networks: When we initialize weights, we need to control their spread. Too large → activations explode. Too small → gradients vanish. The famous Xavier and He initialization schemes are all about setting the right spread.

3.7 The bell curve (normal distribution)

The normal distribution (also called Gaussian, or “bell curve”) is the most important continuous distribution. You’ve seen it: most values cluster near the middle, with fewer values as you move away.

\[ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) \]

Don’t memorize this formula. What matters:

$\mu$ (mu) = the mean (center of the bell)
$\sigma$ (sigma) = the standard deviation (width of the bell)
We write $X \sim \mathcal{N}(\mu, \sigma^2)$ to mean “$X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$”

Why is the bell curve everywhere?

The central limit theorem answers this: if you add up many independent random things, the sum looks like a bell curve—no matter what the individual things look like.

Roll one die: uniform distribution (each number equally likely). Roll 100 dice and sum them: bell curve centered around 350.

This matters for neural networks because each neuron computes a sum:

\[ \text{output} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n \]

Even if individual weights and inputs have weird distributions, their sum tends toward a bell curve. This is why normal distributions appear constantly in neural network analysis.

The standard normal: $\mathcal{N}(0, 1)$ has mean 0 and spread 1. Any normal can be converted to standard normal by subtracting the mean and dividing by the standard deviation:

\[ Z = \frac{X - \mu}{\sigma} \]

If $X \sim \mathcal{N}(\mu, \sigma^2)$, then $Z \sim \mathcal{N}(0, 1)$.

3.8 The categorical distribution and softmax

In transformers, the output is a distribution over a vocabulary of $V$ words. This is a categorical distribution: $V$ possible outcomes with probabilities $p_1, p_2, \ldots, p_V$ where $\sum_i p_i = 1$.

How do we produce such a distribution from a neural network? The network outputs $V$ real numbers (called logits), one for each word. These can be any real numbers—positive, negative, large, small. We need to convert them into valid probabilities.

The softmax function does this:

\[ \text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^V \exp(z_j)} \]

Let’s trace through an example. Suppose $V = 4$ and the logits are $\mathbf{z} = [2.0, 1.0, 0.1, -1.0]$.

Step 1: Exponentiate each logit: \[ [\exp(2.0), \exp(1.0), \exp(0.1), \exp(-1.0)] = [7.39, 2.72, 1.11, 0.37] \]

Step 2: Sum them: $7.39 + 2.72 + 1.11 + 0.37 = 11.59$

Step 3: Divide each by the sum: \[ \text{softmax}(\mathbf{z}) = [0.638, 0.235, 0.096, 0.032] \]

Let’s verify: $0.638 + 0.235 + 0.096 + 0.032 = 1.001 \approx 1$ ✓

Notice what softmax does: - Larger logits → larger probabilities: The logit 2.0 became probability 0.638 (largest) - Preserves ordering: The ranking of logits equals the ranking of probabilities - Exponential amplification: Small differences in logits become larger differences in probabilities

The softmax is soft because it gives non-zero probability to every option (unlike a hard max that picks one). This smoothness is crucial for gradient-based learning.

3.9 Uncertainty (entropy)

How do we measure how “uncertain” or “spread out” a probability distribution is?

Visualizing uncertainty

Imagine a language model predicting the next word. Here are three different predictions:

	Model A (certain)	Model B (uncertain)	Model C (leaning)
cat	95%	25%	70%
dog	3%	25%	10%
bird	1%	25%	10%
fish	1%	25%	10%

Model A is confident—it’s pretty sure the answer is “cat.” Model B has no idea—all options equally likely. Model C leans toward “cat” but isn’t sure.

Entropy is a single number that captures this. Low entropy = confident. High entropy = uncertain.

Friendly term	Formal term	Notation
Uncertainty	Entropy	$H$

Building the formula step by step

The key insight: rare events are surprising.

If I say “the sun rose this morning”—not surprising (it always does). If I say “a meteor hit your car”—very surprising (that almost never happens).

We measure surprise as: $\text{surprise} = -\log(\text{probability})$

Probability	Surprise	Interpretation
1.0	0	Certain things aren’t surprising
0.5	0.7	Coin flip—mildly surprising either way
0.1	2.3	Unlikely things are surprising
0.01	4.6	Rare things are very surprising

Entropy = expected surprise

For each outcome, multiply its probability by its surprise, then add them up:

\[ H = \sum_{\text{outcomes}} \text{probability} \times \text{surprise} = -\sum_x p(x) \log p(x) \]

Computing entropy for our three models

Model A (95% cat, 3% dog, 1% bird, 1% fish):

The likely outcome (cat) has low surprise. The unlikely outcomes have high surprise but low probability, so they contribute little.

\[ H = 0.95 \times 0.05 + 0.03 \times 3.5 + 0.01 \times 4.6 + 0.01 \times 4.6 \approx 0.29 \]

Low entropy—the model is confident.

Model B (25% each):

Every outcome is equally surprising, and equally likely to occur.

\[ H = 4 \times (0.25 \times 1.39) = 1.39 \]

Maximum entropy for 4 outcomes—the model has no idea.

Model C (70% cat, 10% each for the rest):

\[ H = 0.70 \times 0.36 + 3 \times (0.10 \times 2.30) = 0.25 + 0.69 = 0.94 \]

Medium entropy—somewhat confident.

Summary

Model	Entropy	Confidence
A (certain)	0.29	High
C (leaning)	0.94	Medium
B (no idea)	1.39	Low (maximum for 4 outcomes)

Why this matters: During training, we want our model to become more confident about correct answers. Watching entropy decrease is one way to see the model learning.

3.10 Cross-entropy: the training loss

This is the most important formula in machine learning. It’s how we measure “how wrong is the model?” and therefore what we minimize during training.

The setup

We’re classifying an image. The true answer is “dog”, but look at the model’s prediction:

Class	True	Model predicts
cat	0%	70%
dog	100% ← correct answer	10%
bird	0%	15%
fish	0%	5%

The model is wrong—it thinks it’s probably a cat! But how wrong? We need a number.

The idea: how surprised is the model?

The true answer is “dog.” The model assigned 10% to dog.

If you only had a 10% belief in something and it turned out to be true, you’d be pretty surprised!

\[ \text{loss} = -\log(\text{probability model assigned to correct answer}) \]

For our example: $\text{loss} = -\log(0.10) = 2.30$

Friendly term	Formal term	Notation
Loss (how wrong)	Cross-entropy	$H(p, q)$

How loss changes with confidence

Model’s confidence in correct answer	Loss	Note
1%	4.61	Very wrong—huge loss
10%	2.30
25%	1.39
50%	0.69
70%	0.36
90%	0.11
99%	0.01	Almost certain—tiny loss

Notice two things: 1. Wrong predictions are heavily penalized: Going from 1% to 10% reduces loss by 2.3 2. Diminishing returns for high confidence: Going from 90% to 99% only reduces loss by 0.1

This is exactly what we want! The model should focus on getting wrong answers less wrong, rather than making already-good predictions slightly better.

A training story

Watch what happens as the model learns (true answer is “dog”):

Epoch	P(dog)	Loss
1	20%	1.61
2	40%	0.92
3	60%	0.51
4	80%	0.22
5	90%	0.11

Each epoch, the model assigns more probability to “dog”, and the loss decreases.

The formula

When the true answer is simply “it’s class $j$” (one-hot), the formula is just:

\[ \text{loss} = -\log(q_j) \]

where $q_j$ is the probability the model assigned to the correct class.

The general formula (when multiple answers could be correct with different weights):

\[ H(p, q) = -\sum_x p(x) \log q(x) \]

Why cross-entropy is perfect for training

The gradient tells us how to update each probability (true answer is dog):

Class	Predicted	Gradient	Action
cat	70%	+0.70	push DOWN
dog	10%	−0.90	push UP (correct answer)
bird	15%	+0.15	push DOWN
fish	5%	+0.05	push DOWN

The gradient is simply: $\text{predicted} - \text{true}$

For dog (correct): $0.10 - 1.00 = -0.90$ → negative → increase this probability
For wrong answers: $\text{predicted} - 0$ → positive → decrease these probabilities

Training automatically moves probability mass from wrong answers to right answers.

3.11 KL divergence: the “real” distance between distributions

Cross-entropy measures how wrong our model is—but it includes some “baseline” wrongness that’s unavoidable. KL divergence strips away that baseline to measure the pure difference between two distributions.

The problem with cross-entropy

Recall: cross-entropy = $-\sum p(x) \log q(x)$, where $p$ is truth and $q$ is our model.

But even a perfect model has non-zero cross-entropy! If the true distribution is uncertain (say, 50/50), even predicting exactly 50/50 gives:

\[ H(p, p) = -[0.5 \log(0.5) + 0.5 \log(0.5)] = 0.69 \]

This 0.69 isn’t “wrongness”—it’s the inherent uncertainty in the true distribution. You can’t do better than this.

Entropy of the true distribution

The entropy $H(p)$ measures how uncertain the true distribution is:

True distribution	Entropy $H(p)$	Interpretation
[1.0, 0, 0, 0]	0	Certain—one right answer
[0.7, 0.1, 0.1, 0.1]	0.94	Mostly certain
[0.25, 0.25, 0.25, 0.25]	1.39	Maximum uncertainty

This is the minimum possible cross-entropy. No model can beat this—it’s the irreducible uncertainty in the problem itself.

KL divergence = cross-entropy − entropy

\[ D_{KL}(p \| q) = H(p, q) - H(p) = \underbrace{-\sum p(x) \log q(x)}_{\text{cross-entropy}} - \underbrace{(-\sum p(x) \log p(x))}_{\text{entropy}} \]

Simplifying:

\[ D_{KL}(p \| q) = \sum p(x) \log \frac{p(x)}{q(x)} \]

Concrete example

True distribution: $p = [0.7, 0.3]$ (say, 70% chance of “cat”, 30% chance of “dog”)

Model prediction: $q = [0.5, 0.5]$ (model thinks it’s 50/50)

Step 1: Compute entropy of truth \[ H(p) = -[0.7 \log(0.7) + 0.3 \log(0.3)] = -[0.7 \times (-0.36) + 0.3 \times (-1.20)] \] \[ = 0.25 + 0.36 = 0.61 \]

Step 2: Compute cross-entropy \[ H(p, q) = -[0.7 \log(0.5) + 0.3 \log(0.5)] = -[0.7 \times (-0.69) + 0.3 \times (-0.69)] \] \[ = 0.48 + 0.21 = 0.69 \]

Step 3: Compute KL divergence \[ D_{KL}(p \| q) = H(p, q) - H(p) = 0.69 - 0.61 = 0.08 \]

Interpretation: - Cross-entropy (0.69) = total “surprise” when using model $q$ - Entropy (0.61) = unavoidable surprise due to true uncertainty - KL divergence (0.08) = extra surprise caused by model being wrong

Why we minimize cross-entropy, not KL divergence

Since $H(p)$ is constant (the true distribution doesn’t change during training), minimizing cross-entropy and minimizing KL divergence are equivalent:

\[ \arg\min_q H(p, q) = \arg\min_q D_{KL}(p \| q) \]

We use cross-entropy in practice because it’s simpler to compute—we don’t need to know $H(p)$.

Key properties of KL divergence

Property	Meaning
$D_{KL} \geq 0$	Always non-negative
$D_{KL} = 0$ iff $p = q$	Zero only when distributions match exactly
Not symmetric	$D_{KL}(p \\| q) \neq D_{KL}(q \\| p)$ in general

3.12 Conditional probability: updating beliefs with new information

Conditional probability answers: “What do I believe now that I have new information?”

Example: the die behind the screen

You roll a die but can’t see it. What’s the probability it’s even?

All possibilities: 1, 2, 3, 4, 5, 6
Even numbers: 2, 4, 6 → 3 out of 6 = 50%

Now your friend peeks and says: “It’s at least 4.”

Still possible: 4, 5, 6 (only 3 options remain)
Even AND at least 4: 4, 6 → 2 out of 3 = 67%

The new information changed your belief from 50% to 67%.

The notation

$P(A | B)$ means “probability of A, given that B is true”

Read the vertical bar as “given that” or “assuming”

$P(\text{even})$ = 50% (no extra information)
$P(\text{even} | \text{at least 4})$ = 67% (with the hint)

The formula

\[ P(A | B) = \frac{P(\text{both A and B})}{P(B)} \]

For our example: - $P(\text{even AND at least 4}) = P(\{4, 6\}) = 2/6$ - $P(\text{at least 4}) = P(\{4, 5, 6\}) = 3/6$ - $P(\text{even} | \text{at least 4}) = \frac{2/6}{3/6} = \frac{2}{3}$ ✓

Why this matters for language models

A language model is ALL about conditional probability. Given some words, what’s the probability of the next word?

Context: “The cat sat on the”

Next word	Probability
mat	35%
floor	20%
rug	15%
table	10%
elephant	0.01%
…	…

Generating text = chaining conditionals

To compute the probability of a whole sentence, multiply the conditionals:

\[ P(\text{"The cat sat"}) = P(\text{"The"}) \times P(\text{"cat"}|\text{"The"}) \times P(\text{"sat"}|\text{"The cat"}) \]

\[ = 0.05 \times 0.02 \times 0.15 = 0.00015 \]

This is called the chain rule of probability. Transformers generate text exactly this way: predict one token, add it to the context, predict the next token, repeat.

3.13 Independence: when information doesn’t help

Two events are independent if knowing one tells you nothing about the other.

Example: independent events (coin flips)

Flip two coins. All four outcomes are equally likely:

	Coin 2 = H	Coin 2 = T
Coin 1 = H	HH (25%)	HT (25%)
Coin 1 = T	TH (25%)	TT (25%)

If I tell you Coin 1 was heads, what’s the probability Coin 2 is heads?

Still 50%. Coin 1 doesn’t affect Coin 2. They’re independent.

$P(\text{Coin 2 = H} | \text{Coin 1 = H}) = P(\text{Coin 2 = H}) = 50\%$

Example: dependent events (cards)

A standard deck has 52 cards: - Red cards (26): 13 hearts + 13 diamonds - Black cards (26): 13 spades + 13 clubs

If I tell you the card is a heart, what’s the probability it’s red?

100%. Hearts are always red. These events are NOT independent.

$P(\text{red}) = 50\%$, but $P(\text{red} | \text{heart}) = 100\%$

The test for independence

Events A and B are independent if and only if:

\[ P(\text{both}) = P(A) \times P(B) \]

Why this formula?

Start from the definition: A and B are independent if knowing B doesn’t change your belief about A:

\[ P(A | B) = P(A) \]

Now recall the formula for conditional probability:

\[ P(A | B) = \frac{P(\text{both A and B})}{P(B)} \]

If A and B are independent, we can substitute $P(A|B) = P(A)$:

\[ P(A) = \frac{P(\text{both})}{P(B)} \]

Multiply both sides by $P(B)$:

\[ P(A) \times P(B) = P(\text{both}) \]

That’s the formula! It’s just a rearrangement of “knowing B doesn’t change my belief about A.”

Intuition with coins

Why is $P(\text{both heads}) = 0.5 \times 0.5$?

Think of it as: “To get both heads, I need the first coin to land heads (50% chance), AND THEN I need the second coin to land heads (50% chance).”

Since Coin 2 doesn’t care what Coin 1 did, I just multiply the probabilities.

Intuition with cards

Why is $P(\text{red AND heart}) \neq P(\text{red}) \times P(\text{heart})$?

$P(\text{red}) = 0.5$ (26 red cards out of 52)
$P(\text{heart}) = 0.25$ (13 hearts out of 52)
If independent: $P(\text{both}) = 0.5 \times 0.25 = 0.125$

But actually $P(\text{red AND heart}) = P(\text{heart}) = 0.25$, because every heart IS red.

The events aren’t independent—knowing “heart” completely determines “red.”

Why this matters for neural networks

When we initialize weights independently, we can predict how variance adds up:

\[ \text{output} = w_1 x_1 + w_2 x_2 + w_3 x_3 + \cdots \]

If weights are independent:

\[ \text{Var}(\text{output}) = \text{Var}(w_1 x_1) + \text{Var}(w_2 x_2) + \text{Var}(w_3 x_3) + \cdots \]

This formula is the foundation of proper weight initialization (Xavier, He, etc.). Without independence, we couldn’t predict how signals scale through the network.

# Probability basics ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - Distinguish between discrete and continuous random variables - Calculate expectations, variances, and standard deviations - Apply Bayes' theorem to update beliefs given evidence - Compute KL divergence and cross-entropy between distributions - Understand why cross-entropy is the natural loss function for classification ::: Probability is the language of uncertainty. In machine learning, we use it everywhere: our model's output is a probability distribution over possible next tokens, our training objective measures how well predicted probabilities match observed data, and our understanding of why networks work at all relies on probabilistic reasoning. This chapter develops probability from the ground up, focusing on the concepts that matter most for understanding transformers. ## A note on notation In linear algebra, we used **bold uppercase** for matrices ($\mathbf{W}$, $\mathbf{X}$) and **bold lowercase** for vectors ($\mathbf{v}$, $\mathbf{x}$). Probability introduces a different convention: **random variables** are written in *italic uppercase* without bold: $X$, $Y$, $Z$. This is standard notation in probability theory. | Symbol | Meaning | Example | |--------|---------|---------| | $\mathbf{X}$ | Matrix (bold) | A $3 \times 3$ array of weights | | $\mathbf{x}$ | Vector (bold lowercase) | A point in 3D space: $[1, 2, 3]^T$ | | $X$ | Random variable (italic, not bold) | The outcome of rolling a die | | $x$ | A specific value (lowercase) | The number 4 | A random variable isn't a number—it's a placeholder for an outcome that hasn't been determined yet. Think of it as the *question* rather than the *answer*. **Concrete example**: You're about to roll a die. - Before rolling: $X$ represents the uncertain outcome. We don't know what it will be yet. $X$ could be 1, 2, 3, 4, 5, or 6. - After rolling: You got a 4. Now $x = 4$ is the specific value that occurred. The notation $P(X = 4)$ asks: "Before rolling, what's the probability that $X$ will equal 4?" For a fair die, $P(X = 4) = \frac{1}{6}$. We write $p(x)$ as shorthand for $P(X = x)$—the probability that the random variable $X$ takes value $x$. So for our die: - $p(1) = P(X = 1) = \frac{1}{6}$ - $p(2) = P(X = 2) = \frac{1}{6}$ - ... - $p(6) = P(X = 6) = \frac{1}{6}$ **Another example**: Let $Y$ be "the next word a language model predicts." - $Y$ is the random variable—the uncertain outcome - $y$ = "cat" is one possible value - $P(Y = \text{"cat"}) = 0.15$ means the model assigns 15% probability to "cat" - $p(\text{"cat"}) = 0.15$ is the same thing, shorter notation ## Why probability for neural networks? Consider what a language model does. Given the input "The cat sat on the", what should the model output? Not a single word—that would be too confident. The model should express *uncertainty*: "mat" is likely, "floor" is possible, "elephant" is unlikely but not impossible. The model outputs a **probability distribution** over the vocabulary: | Word | Probability | |------|-------------| | mat | 0.35 | | floor | 0.20 | | rug | 0.15 | | bed | 0.10 | | ... | ... | | elephant | 0.0001 | These probabilities must be non-negative and sum to 1. They express the model's beliefs about what comes next. Training the model means adjusting its parameters so that its predicted probabilities align with reality. If the true next word was "mat," we want the model to assign high probability to "mat." This is where the loss function comes in—but to understand loss functions, we need to understand probability properly. ## Probability as a measure of belief What does "$P(\text{mat}) = 0.35$" mean? There are two interpretations: **Frequentist**: If we saw "The cat sat on the" many times, about 35% of the time the next word would be "mat." **Bayesian**: The model assigns a 35% degree of belief to "mat" being the next word. For neural networks, the Bayesian interpretation is more useful. The model doesn't have access to infinite repetitions—it has seen some training data and formed beliefs based on it. The key constraint on probabilities is that they must form a **valid distribution**: 1. Every probability is non-negative: $P(x) \geq 0$ for all $x$ 2. Probabilities sum to 1: $\sum_x P(x) = 1$ These aren't arbitrary rules—they ensure probabilities behave sensibly. If we're certain one thing will happen, we assign it probability 1 and everything else probability 0. If we're completely uncertain among $n$ options, we assign each probability $1/n$. ## Discrete vs continuous A **discrete** random variable takes countably many values (like words in a vocabulary). A **continuous** random variable takes values in a continuum (like the real numbers). For discrete variables, we have a **probability mass function** (PMF): $$ p(x) = P(X = x) $$ Each value $x$ has a specific probability. **Example**: A fair die has $p(1) = p(2) = \cdots = p(6) = \frac{1}{6}$. For continuous variables, we can't assign probability to individual points (there are infinitely many). Instead, we have a **probability density function** (PDF) $f(x)$, and probability comes from integration: $$ P(a \leq X \leq b) = \int_a^b f(x) \, dx $$ The density $f(x)$ tells us how "concentrated" probability is near $x$. High density means outcomes near $x$ are more likely; low density means they're less likely. In transformers, we mostly work with discrete distributions (over vocabulary tokens), but continuous distributions appear in weight initialization and analysis. ## The mean (average) You already know what an average is. Roll a die 6 times, get 2, 5, 3, 6, 1, 4, add them up, divide by 6: $$ \text{average} = \frac{2 + 5 + 3 + 6 + 1 + 4}{6} = \frac{21}{6} = 3.5 $$ But what if we haven't rolled yet? We can still compute the *theoretical* average—what we'd expect to get if we rolled many times. For a fair die, each number 1–6 is equally likely (probability $\frac{1}{6}$), so: $$ \text{mean} = 1 \times \frac{1}{6} + 2 \times \frac{1}{6} + 3 \times \frac{1}{6} + 4 \times \frac{1}{6} + 5 \times \frac{1}{6} + 6 \times \frac{1}{6} = 3.5 $$ Each outcome contributes its value times its probability. This is a **weighted average**, where the weights are probabilities. In statistics, this theoretical average is called the **expectation** or **expected value**, written $\mathbb{E}[X]$: | Friendly term | Formal term | Notation | |---------------|-------------|----------| | Mean, average | Expectation | $\mathbb{E}[X]$ | The formula: $$ \text{mean} = \mathbb{E}[X] = \sum_{\text{all outcomes } x} x \times P(x) $$ Or more compactly: $\mathbb{E}[X] = \sum_x x \cdot p(x)$ **Example: loaded die** Suppose a loaded die has: - 50% chance of rolling 6 - 10% chance each for 1, 2, 3, 4, 5 $$ \text{mean} = 1 \times 0.1 + 2 \times 0.1 + 3 \times 0.1 + 4 \times 0.1 + 5 \times 0.1 + 6 \times 0.5 $$ $$ = 0.1 + 0.2 + 0.3 + 0.4 + 0.5 + 3.0 = 4.5 $$ The loaded die's mean (4.5) is higher than the fair die's mean (3.5) because high outcomes are more likely. **Example: language model** A language model predicts the next word with probabilities. Suppose the possible words and their probabilities are: | Word | Probability | "Value" (arbitrary score) | |------|-------------|---------------------------| | cat | 0.4 | 10 | | dog | 0.3 | 8 | | bird | 0.2 | 6 | | fish | 0.1 | 4 | Mean score: $10 \times 0.4 + 8 \times 0.3 + 6 \times 0.2 + 4 \times 0.1 = 4 + 2.4 + 1.2 + 0.4 = 8.0$ **Key property**: The mean is linear. If you have two uncertain quantities and add them: $$ \text{mean of } (aX + bY) = a \times \text{mean of } X + b \times \text{mean of } Y $$ In formal notation: $\mathbb{E}[aX + bY] = a\mathbb{E}[X] + b\mathbb{E}[Y]$ This works even if $X$ and $Y$ are related to each other. Linearity makes means easy to work with. ## Spread (variance) The mean tells us where the distribution is centered. But two distributions can have the same mean yet look very different: - Distribution A: always returns 5 - Distribution B: returns 0 or 10, each with 50% chance Both have mean 5. But A always gives exactly 5, while B is all over the place. We need a number that measures this **spread**—how far outcomes typically fall from the mean. | Friendly term | Formal term | Notation | |---------------|-------------|----------| | Spread | Variance | $\text{Var}(X)$ | **How should we measure spread?** **Attempt 1: average distance from mean** For distribution B, let's compute how far each outcome is from the mean (5): - Outcome 0: distance from mean $= 0 - 5 = -5$ - Outcome 10: distance from mean $= 10 - 5 = +5$ Average distance: $\frac{(-5) + (+5)}{2} = 0$ That's wrong! B is clearly spread out, but we got zero. The problem: positive and negative distances cancelled out. **Attempt 2: average of absolute distances** Let's use absolute values to prevent cancellation: - $|0 - 5| = 5$ - $|10 - 5| = 5$ Average: $\frac{5 + 5}{2} = 5$ This works! But absolute value has a problem for machine learning: it's not smooth. The function $|x|$ has a sharp corner at $x = 0$, which causes issues for gradient descent (we need derivatives, and corners don't have well-defined derivatives). **Attempt 3: average of squared distances (this is variance)** Instead of absolute value, let's square the distances: - $(0 - 5)^2 = 25$ - $(10 - 5)^2 = 25$ Average: $\frac{25 + 25}{2} = 25$ This is the **variance**. In words: $$ \text{spread} = \text{average of (distance from mean)}^2 $$ In formal notation: $$ \text{Var}(X) = \mathbb{E}[(X - \mathbb{E}[X])^2] $$ Which reads as: "the expected value of the squared deviation from the expected value." **Why squared distances?** 1. **No cancellation**: Squares are always positive 2. **Smooth function**: $x^2$ has derivatives everywhere—perfect for gradient descent 3. **Punishes big deviations**: Being 10 away contributes $100$, being 2 away contributes only $4$ 4. **Adds nicely**: For independent random variables, $\text{spread of } (X + Y) = \text{spread of } X + \text{spread of } Y$ **Computing spread step by step** **Distribution A** (always 5): Mean = 5. Only outcome is 5. $$ \text{spread} = (5 - 5)^2 \times 1 = 0 $$ Zero spread—every sample equals the mean. **Distribution B** (0 or 10 with equal probability): Mean = $0 \times 0.5 + 10 \times 0.5 = 5$ $$ \text{spread} = (0 - 5)^2 \times 0.5 + (10 - 5)^2 \times 0.5 = 25 \times 0.5 + 25 \times 0.5 = 25 $$ **Distribution C** (outcomes 4, 5, 6 with equal probability): Mean = $\frac{4 + 5 + 6}{3} = 5$ $$ \text{spread} = (4-5)^2 \times \frac{1}{3} + (5-5)^2 \times \frac{1}{3} + (6-5)^2 \times \frac{1}{3} $$ $$ = 1 \times \frac{1}{3} + 0 \times \frac{1}{3} + 1 \times \frac{1}{3} = \frac{2}{3} \approx 0.67 $$ **Summary**: Three distributions, all with mean 5, but different spreads: | Distribution | Outcomes | Spread (variance) | |--------------|----------|-------------------| | A | Always 5 | 0 | | C | 4, 5, or 6 | 0.67 | | B | 0 or 10 | 25 | **A computational shortcut** There's an equivalent formula that's often easier to compute: $$ \text{spread} = \text{Var}(X) = \mathbb{E}[X^2] - (\mathbb{E}[X])^2 $$ In words: "mean of squares" minus "square of mean." Let's verify for distribution B: - Mean: $\mathbb{E}[X] = 5$ - $\mathbb{E}[X^2] = 0^2 \cdot 0.5 + 10^2 \cdot 0.5 = 0 + 50 = 50$ - $\text{Var}(X) = 50 - 5^2 = 50 - 25 = 25$ ✓ **Standard deviation**: The problem with variance is that it's in squared units. If $X$ is in meters, variance is in meters². To get back to the original units, take the square root: $$ \text{standard deviation} = \sigma = \sqrt{\text{variance}} $$ For distribution B: $\sigma = \sqrt{25} = 5$ | Friendly term | Formal term | Notation | |---------------|-------------|----------| | Spread (squared) | Variance | $\text{Var}(X)$ | | Spread (same units) | Standard deviation | $\sigma$ or $\text{std}(X)$ | **Important clarification**: Standard deviation is the typical *distance from the mean*, not a typical value you'd see. Distribution B has outcomes 0 and 10, mean 5, standard deviation 5. You'll never actually get a 5—but when you get 0 or 10, you're always exactly 5 away from the mean. That's what σ = 5 is telling you: "values typically land about 5 units away from the mean." **Why spread matters for neural networks**: When we initialize weights, we need to control their spread. Too large → activations explode. Too small → gradients vanish. The famous Xavier and He initialization schemes are all about setting the right spread. ## The bell curve (normal distribution) The **normal distribution** (also called Gaussian, or "bell curve") is the most important continuous distribution. You've seen it: most values cluster near the middle, with fewer values as you move away. $$ f(x) = \frac{1}{\sqrt{2\pi\sigma^2}} \exp\left(-\frac{(x - \mu)^2}{2\sigma^2}\right) $$ Don't memorize this formula. What matters: - $\mu$ (mu) = the mean (center of the bell) - $\sigma$ (sigma) = the standard deviation (width of the bell) - We write $X \sim \mathcal{N}(\mu, \sigma^2)$ to mean "$X$ follows a normal distribution with mean $\mu$ and variance $\sigma^2$" **Why is the bell curve everywhere?** The **central limit theorem** answers this: if you add up many independent random things, the sum looks like a bell curve—no matter what the individual things look like. Roll one die: uniform distribution (each number equally likely). Roll 100 dice and sum them: bell curve centered around 350. This matters for neural networks because each neuron computes a sum: $$ \text{output} = w_1 x_1 + w_2 x_2 + \cdots + w_n x_n $$ Even if individual weights and inputs have weird distributions, their sum tends toward a bell curve. This is why normal distributions appear constantly in neural network analysis. **The standard normal**: $\mathcal{N}(0, 1)$ has mean 0 and spread 1. Any normal can be converted to standard normal by subtracting the mean and dividing by the standard deviation: $$ Z = \frac{X - \mu}{\sigma} $$ If $X \sim \mathcal{N}(\mu, \sigma^2)$, then $Z \sim \mathcal{N}(0, 1)$. ## The categorical distribution and softmax In transformers, the output is a distribution over a vocabulary of $V$ words. This is a **categorical distribution**: $V$ possible outcomes with probabilities $p_1, p_2, \ldots, p_V$ where $\sum_i p_i = 1$. How do we produce such a distribution from a neural network? The network outputs $V$ real numbers (called **logits**), one for each word. These can be any real numbers—positive, negative, large, small. We need to convert them into valid probabilities. The **softmax** function does this: $$ \text{softmax}(z_i) = \frac{\exp(z_i)}{\sum_{j=1}^V \exp(z_j)} $$ Let's trace through an example. Suppose $V = 4$ and the logits are $\mathbf{z} = [2.0, 1.0, 0.1, -1.0]$. Step 1: Exponentiate each logit: $$ [\exp(2.0), \exp(1.0), \exp(0.1), \exp(-1.0)] = [7.39, 2.72, 1.11, 0.37] $$ Step 2: Sum them: $7.39 + 2.72 + 1.11 + 0.37 = 11.59$ Step 3: Divide each by the sum: $$ \text{softmax}(\mathbf{z}) = [0.638, 0.235, 0.096, 0.032] $$ Let's verify: $0.638 + 0.235 + 0.096 + 0.032 = 1.001 \approx 1$ ✓ Notice what softmax does: - **Larger logits → larger probabilities**: The logit 2.0 became probability 0.638 (largest) - **Preserves ordering**: The ranking of logits equals the ranking of probabilities - **Exponential amplification**: Small differences in logits become larger differences in probabilities The softmax is *soft* because it gives non-zero probability to every option (unlike a hard max that picks one). This smoothness is crucial for gradient-based learning. ## Uncertainty (entropy) How do we measure how "uncertain" or "spread out" a probability distribution is? **Visualizing uncertainty** Imagine a language model predicting the next word. Here are three different predictions: | | Model A (certain) | Model B (uncertain) | Model C (leaning) | |------|-------------------|---------------------|-------------------| | cat | 95% | 25% | 70% | | dog | 3% | 25% | 10% | | bird | 1% | 25% | 10% | | fish | 1% | 25% | 10% | Model A is confident—it's pretty sure the answer is "cat." Model B has no idea—all options equally likely. Model C leans toward "cat" but isn't sure. **Entropy** is a single number that captures this. Low entropy = confident. High entropy = uncertain. | Friendly term | Formal term | Notation | |---------------|-------------|----------| | Uncertainty | Entropy | $H$ | **Building the formula step by step** The key insight: **rare events are surprising**. If I say "the sun rose this morning"—not surprising (it always does). If I say "a meteor hit your car"—very surprising (that almost never happens). We measure surprise as: $\text{surprise} = -\log(\text{probability})$ | Probability | Surprise | Interpretation | |-------------|----------|----------------| | 1.0 | 0 | Certain things aren't surprising | | 0.5 | 0.7 | Coin flip—mildly surprising either way | | 0.1 | 2.3 | Unlikely things are surprising | | 0.01 | 4.6 | Rare things are very surprising | **Entropy = expected surprise** For each outcome, multiply its probability by its surprise, then add them up: $$ H = \sum_{\text{outcomes}} \text{probability} \times \text{surprise} = -\sum_x p(x) \log p(x) $$ **Computing entropy for our three models** **Model A** (95% cat, 3% dog, 1% bird, 1% fish): The likely outcome (cat) has low surprise. The unlikely outcomes have high surprise but low probability, so they contribute little. $$ H = 0.95 \times 0.05 + 0.03 \times 3.5 + 0.01 \times 4.6 + 0.01 \times 4.6 \approx 0.29 $$ Low entropy—the model is confident. **Model B** (25% each): Every outcome is equally surprising, and equally likely to occur. $$ H = 4 \times (0.25 \times 1.39) = 1.39 $$ Maximum entropy for 4 outcomes—the model has no idea. **Model C** (70% cat, 10% each for the rest): $$ H = 0.70 \times 0.36 + 3 \times (0.10 \times 2.30) = 0.25 + 0.69 = 0.94 $$ Medium entropy—somewhat confident. **Summary** | Model | Entropy | Confidence | |-------|---------|------------| | A (certain) | 0.29 | High | | C (leaning) | 0.94 | Medium | | B (no idea) | 1.39 | Low (maximum for 4 outcomes) | **Why this matters**: During training, we want our model to become more confident about correct answers. Watching entropy decrease is one way to see the model learning. ## Cross-entropy: the training loss This is **the most important formula in machine learning**. It's how we measure "how wrong is the model?" and therefore what we minimize during training. **The setup** We're classifying an image. The true answer is "dog", but look at the model's prediction: | Class | True | Model predicts | |-------|------|----------------| | cat | 0% | 70% | | dog | **100%** ← correct answer | 10% | | bird | 0% | 15% | | fish | 0% | 5% | The model is wrong—it thinks it's probably a cat! But *how* wrong? We need a number. **The idea: how surprised is the model?** The true answer is "dog." The model assigned 10% to dog. If you only had a 10% belief in something and it turned out to be true, you'd be pretty surprised! $$ \text{loss} = -\log(\text{probability model assigned to correct answer}) $$ For our example: $\text{loss} = -\log(0.10) = 2.30$ | Friendly term | Formal term | Notation | |---------------|-------------|----------| | Loss (how wrong) | Cross-entropy | $H(p, q)$ | **How loss changes with confidence** | Model's confidence in correct answer | Loss | Note | |--------------------------------------|------|------| | 1% | 4.61 | Very wrong—huge loss | | 10% | 2.30 | | | 25% | 1.39 | | | 50% | 0.69 | | | 70% | 0.36 | | | 90% | 0.11 | | | 99% | 0.01 | Almost certain—tiny loss | Notice two things: 1. **Wrong predictions are heavily penalized**: Going from 1% to 10% reduces loss by 2.3 2. **Diminishing returns for high confidence**: Going from 90% to 99% only reduces loss by 0.1 This is exactly what we want! The model should focus on getting wrong answers less wrong, rather than making already-good predictions slightly better. **A training story** Watch what happens as the model learns (true answer is "dog"): | Epoch | P(dog) | Loss | |-------|--------|------| | 1 | 20% | 1.61 | | 2 | 40% | 0.92 | | 3 | 60% | 0.51 | | 4 | 80% | 0.22 | | 5 | 90% | 0.11 | Each epoch, the model assigns more probability to "dog", and the loss decreases. **The formula** When the true answer is simply "it's class $j$" (one-hot), the formula is just: $$ \text{loss} = -\log(q_j) $$ where $q_j$ is the probability the model assigned to the correct class. The general formula (when multiple answers could be correct with different weights): $$ H(p, q) = -\sum_x p(x) \log q(x) $$ **Why cross-entropy is perfect for training** The gradient tells us how to update each probability (true answer is dog): | Class | Predicted | Gradient | Action | |-------|-----------|----------|--------| | cat | 70% | +0.70 | push DOWN | | dog | 10% | −0.90 | push UP (correct answer) | | bird | 15% | +0.15 | push DOWN | | fish | 5% | +0.05 | push DOWN | The gradient is simply: $\text{predicted} - \text{true}$ - For dog (correct): $0.10 - 1.00 = -0.90$ → negative → increase this probability - For wrong answers: $\text{predicted} - 0$ → positive → decrease these probabilities Training automatically moves probability mass from wrong answers to right answers. ## KL divergence: the "real" distance between distributions Cross-entropy measures how wrong our model is—but it includes some "baseline" wrongness that's unavoidable. **KL divergence** strips away that baseline to measure the *pure* difference between two distributions. **The problem with cross-entropy** Recall: cross-entropy = $-\sum p(x) \log q(x)$, where $p$ is truth and $q$ is our model. But even a *perfect* model has non-zero cross-entropy! If the true distribution is uncertain (say, 50/50), even predicting exactly 50/50 gives: $$ H(p, p) = -[0.5 \log(0.5) + 0.5 \log(0.5)] = 0.69 $$ This 0.69 isn't "wrongness"—it's the **inherent uncertainty** in the true distribution. You can't do better than this. **Entropy of the true distribution** The entropy $H(p)$ measures how uncertain the true distribution is: | True distribution | Entropy $H(p)$ | Interpretation | |-------------------|----------------|----------------| | [1.0, 0, 0, 0] | 0 | Certain—one right answer | | [0.7, 0.1, 0.1, 0.1] | 0.94 | Mostly certain | | [0.25, 0.25, 0.25, 0.25] | 1.39 | Maximum uncertainty | This is the *minimum possible* cross-entropy. No model can beat this—it's the irreducible uncertainty in the problem itself. **KL divergence = cross-entropy − entropy** $$ D_{KL}(p \| q) = H(p, q) - H(p) = \underbrace{-\sum p(x) \log q(x)}_{\text{cross-entropy}} - \underbrace{(-\sum p(x) \log p(x))}_{\text{entropy}} $$ Simplifying: $$ D_{KL}(p \| q) = \sum p(x) \log \frac{p(x)}{q(x)} $$ **Concrete example** True distribution: $p = [0.7, 0.3]$ (say, 70% chance of "cat", 30% chance of "dog") Model prediction: $q = [0.5, 0.5]$ (model thinks it's 50/50) **Step 1: Compute entropy of truth** $$ H(p) = -[0.7 \log(0.7) + 0.3 \log(0.3)] = -[0.7 \times (-0.36) + 0.3 \times (-1.20)] $$ $$ = 0.25 + 0.36 = 0.61 $$ **Step 2: Compute cross-entropy** $$ H(p, q) = -[0.7 \log(0.5) + 0.3 \log(0.5)] = -[0.7 \times (-0.69) + 0.3 \times (-0.69)] $$ $$ = 0.48 + 0.21 = 0.69 $$ **Step 3: Compute KL divergence** $$ D_{KL}(p \| q) = H(p, q) - H(p) = 0.69 - 0.61 = 0.08 $$ **Interpretation**: - Cross-entropy (0.69) = total "surprise" when using model $q$ - Entropy (0.61) = unavoidable surprise due to true uncertainty - KL divergence (0.08) = *extra* surprise caused by model being wrong **Why we minimize cross-entropy, not KL divergence** Since $H(p)$ is constant (the true distribution doesn't change during training), minimizing cross-entropy and minimizing KL divergence are equivalent: $$ \arg\min_q H(p, q) = \arg\min_q D_{KL}(p \| q) $$ We use cross-entropy in practice because it's simpler to compute—we don't need to know $H(p)$. **Key properties of KL divergence** | Property | Meaning | |----------|---------| | $D_{KL} \geq 0$ | Always non-negative | | $D_{KL} = 0$ iff $p = q$ | Zero only when distributions match exactly | | Not symmetric | $D_{KL}(p \| q) \neq D_{KL}(q \| p)$ in general | ## Conditional probability: updating beliefs with new information **Conditional probability** answers: "What do I believe now that I have new information?" **Example: the die behind the screen** You roll a die but can't see it. What's the probability it's even? - All possibilities: 1, 2, 3, 4, 5, 6 - Even numbers: 2, 4, 6 → **3 out of 6 = 50%** Now your friend peeks and says: "It's at least 4." - Still possible: 4, 5, 6 (only 3 options remain) - Even AND at least 4: 4, 6 → **2 out of 3 = 67%** The new information changed your belief from 50% to 67%. **The notation** $P(A | B)$ means "probability of A, given that B is true" Read the vertical bar as "given that" or "assuming" - $P(\text{even})$ = 50% (no extra information) - $P(\text{even} | \text{at least 4})$ = 67% (with the hint) **The formula** $$ P(A | B) = \frac{P(\text{both A and B})}{P(B)} $$ For our example: - $P(\text{even AND at least 4}) = P(\{4, 6\}) = 2/6$ - $P(\text{at least 4}) = P(\{4, 5, 6\}) = 3/6$ - $P(\text{even} | \text{at least 4}) = \frac{2/6}{3/6} = \frac{2}{3}$ ✓ **Why this matters for language models** A language model is ALL about conditional probability. Given some words, what's the probability of the next word? Context: "The cat sat on the" | Next word | Probability | |-----------|-------------| | mat | 35% | | floor | 20% | | rug | 15% | | table | 10% | | elephant | 0.01% | | ... | ... | **Generating text = chaining conditionals** To compute the probability of a whole sentence, multiply the conditionals: $$ P(\text{"The cat sat"}) = P(\text{"The"}) \times P(\text{"cat"}|\text{"The"}) \times P(\text{"sat"}|\text{"The cat"}) $$ $$ = 0.05 \times 0.02 \times 0.15 = 0.00015 $$ This is called the **chain rule of probability**. Transformers generate text exactly this way: predict one token, add it to the context, predict the next token, repeat. ## Independence: when information doesn't help Two events are **independent** if knowing one tells you nothing about the other. **Example: independent events (coin flips)** Flip two coins. All four outcomes are equally likely: | | Coin 2 = H | Coin 2 = T | |--|------------|------------| | **Coin 1 = H** | HH (25%) | HT (25%) | | **Coin 1 = T** | TH (25%) | TT (25%) | If I tell you Coin 1 was heads, what's the probability Coin 2 is heads? Still 50%. Coin 1 doesn't affect Coin 2. They're **independent**. $P(\text{Coin 2 = H} | \text{Coin 1 = H}) = P(\text{Coin 2 = H}) = 50\%$ **Example: dependent events (cards)** A standard deck has 52 cards: - **Red cards (26)**: 13 hearts + 13 diamonds - **Black cards (26)**: 13 spades + 13 clubs If I tell you the card is a heart, what's the probability it's red? 100%. Hearts are always red. These events are **NOT independent**. $P(\text{red}) = 50\%$, but $P(\text{red} | \text{heart}) = 100\%$ **The test for independence** Events A and B are independent if and only if: $$ P(\text{both}) = P(A) \times P(B) $$ **Why this formula?** Start from the definition: A and B are independent if knowing B doesn't change your belief about A: $$ P(A | B) = P(A) $$ Now recall the formula for conditional probability: $$ P(A | B) = \frac{P(\text{both A and B})}{P(B)} $$ If A and B are independent, we can substitute $P(A|B) = P(A)$: $$ P(A) = \frac{P(\text{both})}{P(B)} $$ Multiply both sides by $P(B)$: $$ P(A) \times P(B) = P(\text{both}) $$ That's the formula! It's just a rearrangement of "knowing B doesn't change my belief about A." **Intuition with coins** Why is $P(\text{both heads}) = 0.5 \times 0.5$? Think of it as: "To get both heads, I need the first coin to land heads (50% chance), AND THEN I need the second coin to land heads (50% chance)." Since Coin 2 doesn't care what Coin 1 did, I just multiply the probabilities. **Intuition with cards** Why is $P(\text{red AND heart}) \neq P(\text{red}) \times P(\text{heart})$? - $P(\text{red}) = 0.5$ (26 red cards out of 52) - $P(\text{heart}) = 0.25$ (13 hearts out of 52) - If independent: $P(\text{both}) = 0.5 \times 0.25 = 0.125$ But actually $P(\text{red AND heart}) = P(\text{heart}) = 0.25$, because every heart IS red. The events aren't independent—knowing "heart" completely determines "red." **Why this matters for neural networks** When we initialize weights independently, we can predict how variance adds up: $$ \text{output} = w_1 x_1 + w_2 x_2 + w_3 x_3 + \cdots $$ If weights are independent: $$ \text{Var}(\text{output}) = \text{Var}(w_1 x_1) + \text{Var}(w_2 x_2) + \text{Var}(w_3 x_3) + \cdots $$ This formula is the foundation of proper weight initialization (Xavier, He, etc.). Without independence, we couldn't predict how signals scale through the network.