1 Linear algebra essentials

Learning objectives

After completing this chapter, you will be able to:

Define vectors, vector spaces, and linear independence
Compute dot products and interpret them geometrically as similarity measures
Perform matrix multiplication and understand it as a linear transformation
Calculate eigenvalues and eigenvectors, and understand their geometric meaning
Apply these concepts to understand how neural networks transform data

Before we dive into transformers, we need to establish a solid mathematical foundation. Linear algebra is the language of neural networks—every operation in a transformer can be understood as a matrix operation. This chapter reviews the essential concepts we’ll use throughout the book.

1.1 Vectors and vector spaces

A vector is an ordered list of numbers. We write vectors as column matrices and denote them with lowercase bold letters:

\[ \mathbf{v} = \begin{bmatrix} v_1 \ v_2 \ \vdots \ v_n \end{bmatrix} \]

The set of all n-dimensional real vectors forms a vector space $\mathbb{R}^n$. A vector space must satisfy certain properties. If $\mathbf{u}, \mathbf{v}, \mathbf{w}$ are vectors and $a, b$ are scalars, then:

Closure under addition: $\mathbf{u} + \mathbf{v}$ is also in the space
Closure under scalar multiplication: $a\mathbf{v}$ is also in the space
Associativity: $(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$
Commutativity: $\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$
Identity: There exists a zero vector $\mathbf{0}$ such that $\mathbf{v} + \mathbf{0} = \mathbf{v}$
Inverse: For every $\mathbf{v}$ there exists $-\mathbf{v}$ such that $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$
Distributivity: $a(\mathbf{u} + \mathbf{v}) = a\mathbf{u} + a\mathbf{v}$ and $(a + b)\mathbf{v} = a\mathbf{v} + b\mathbf{v}$
Scalar multiplication associativity: $a(b\mathbf{v}) = (ab)\mathbf{v}$
Scalar identity: $1\mathbf{v} = \mathbf{v}$

Why do we care about these properties? Because they guarantee we can perform algebraic manipulations safely. In transformers, we’ll constantly be adding vectors (combining information) and scaling them (adjusting magnitudes), so we need these operations to behave predictably.

Before we continue, let’s clarify what “linear” means. We’ll encounter this word everywhere: linear combinations, linear independence, linear transformations. The term “linear” captures a fundamental idea: operations that respect scaling and addition. Specifically, a function or operation $f$ is linear if it satisfies two properties:

Scaling: $f(a\mathbf{v}) = af(\mathbf{v})$ for any scalar $a$
Addition: $f(\mathbf{u} + \mathbf{v}) = f(\mathbf{u}) + f(\mathbf{v})$

These can be combined into one property: $f(a\mathbf{u} + b\mathbf{v}) = af(\mathbf{u}) + bf(\mathbf{v})$. Linear operations are simple in a precise sense. They don’t have interactions or nonlinear terms. If you double the input, you double the output. If you add two inputs, you can process them separately and add the results. This makes linear operations tractable to analyze mathematically, which is why we study them first. Of course, transformers are not purely linear (otherwise they’d be very limited), but understanding the linear parts is essential before we add nonlinearity.

What violates linearity? Consider $f(x) = x^2$. This fails scaling: $f(2x) = (2x)^2 = 4x^2$, but $2f(x) = 2x^2$. The squaring creates an extra factor. Or consider $f(x) = x + 1$. This fails the zero test: a linear function must map zero to zero (since $f(0) = f(0 \cdot \mathbf{v}) = 0 \cdot f(\mathbf{v}) = 0$), but $f(0) = 1 \neq 0$. Any constant shift breaks linearity.

A linear combination of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is any expression of the form $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k$ where $c_1, c_2, \ldots, c_k$ are scalars. We’re mixing the vectors together with different weights. This is called “linear” because the relationship between the coefficients $c_i$ and the result is linear: if you double all coefficients, you double the result. If you add two linear combinations, you get another linear combination.

A set of vectors {$\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$} is linearly independent if none of them is redundant. More precisely, no vector in the set can be obtained by mixing together the others. For example, in $\mathbb{R}^2$, the vectors $\mathbf{v}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}$ and $\mathbf{v}_2 = \begin{bmatrix} 0 \ 1 \end{bmatrix}$ are linearly independent because you can’t get $\mathbf{v}_1$ by scaling $\mathbf{v}_2$, and vice versa. But $\mathbf{v}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}$, $\mathbf{v}_2 = \begin{bmatrix} 0 \ 1 \end{bmatrix}$, and $\mathbf{v}_3 = \begin{bmatrix} 2 \ 3 \end{bmatrix}$ are linearly dependent because $\mathbf{v}_3 = 2\mathbf{v}_1 + 3\mathbf{v}_2$.

The formal definition captures this idea precisely. The vectors are linearly independent if the only way to make the zero vector from a linear combination is to use all zero coefficients:

\[c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k = \mathbf{0}\]

has $c_1 = c_2 = \cdots = c_k = 0$ as its only solution. Why does this work? Suppose we could write one vector, say $\mathbf{v}_1$, as a combination of the others: $\mathbf{v}_1 = a_2\mathbf{v}_2 + \cdots + a_k\mathbf{v}_k$. Then we could rearrange to get $\mathbf{v}_1 - a_2\mathbf{v}_2 - \cdots - a_k\mathbf{v}_k = \mathbf{0}$, which is a linear combination equaling zero with nonzero coefficients. So if any vector is redundant (expressible via the others), we can find a nonzero combination that equals zero. The formal definition says this can’t happen. A basis for a vector space is a linearly independent set of vectors that spans the entire space. The number of vectors in a basis is the dimension of the space. For $\mathbb{R}^n$, there are infinitely many possible bases. The most common one is the standard basis:

\[ \mathbf{e}_1 = \begin{bmatrix} 1 \ 0 \ \vdots \ 0 \end{bmatrix}, \quad \mathbf{e}_2 = \begin{bmatrix} 0 \ 1 \ \vdots \ 0 \end{bmatrix}, \quad \ldots, \quad \mathbf{e}_n = \begin{bmatrix} 0 \ 0 \ \vdots \ 1 \end{bmatrix} \]

But we could equally well use any other set of $n$ linearly independent vectors. For example, in $\mathbb{R}^2$, the vectors $\begin{bmatrix} 1 \ 1 \end{bmatrix}$ and $\begin{bmatrix} 1 \ -1 \end{bmatrix}$ form a perfectly valid basis, just a different one from the standard basis.

Every vector $\mathbf{v} \in \mathbb{R}^n$ can be uniquely written as $\mathbf{v} = v_1\mathbf{e}_1 + v_2\mathbf{e}_2 + \cdots + v_n\mathbf{e}_n$.

Here’s a crucial insight for transformers: the same vector can be expressed in different bases, giving different coordinate representations of the same underlying information. Let’s work through a concrete example to see exactly what this means. Consider the vector $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ in the standard basis. This notation means:

\[ \mathbf{v} = 3\begin{bmatrix} 1 \ 0 \end{bmatrix} + 4\begin{bmatrix} 0 \ 1 \end{bmatrix} \]

Now let’s use a different basis: $\mathbf{b}_1 = \begin{bmatrix} 1 \ 1 \end{bmatrix}$ and $\mathbf{b}_2 = \begin{bmatrix} 1 \ -1 \end{bmatrix}$. These vectors are linearly independent (you can’t get one by scaling the other), so they form a valid basis for $\mathbb{R}^2$. What are the coordinates of $\mathbf{v}$ in this new basis? We need to find $c_1$ and $c_2$ such that:

\[ c_1\begin{bmatrix} 1 \ 1 \end{bmatrix} + c_2\begin{bmatrix} 1 \ -1 \end{bmatrix} = \begin{bmatrix} 3 \ 4 \end{bmatrix} \]

This gives us two equations: \[ c_1 + c_2 = 3 \] \[ c_1 - c_2 = 4 \]

Adding these equations: $2c_1 = 7$, so $c_1 = 7/2$. Subtracting: $2c_2 = -1$, so $c_2 = -1/2$. Let’s verify:

\[ \frac{7}{2}\begin{bmatrix} 1 \ 1 \end{bmatrix} + \left(-\frac{1}{2}\right)\begin{bmatrix} 1 \ -1 \end{bmatrix} = \begin{bmatrix} 7/2 \ 7/2 \end{bmatrix} + \begin{bmatrix} -1/2 \ 1/2 \end{bmatrix} = \begin{bmatrix} 3 \ 4 \end{bmatrix} \quad \checkmark \]

The same geometric vector is represented as $\begin{bmatrix} 3 \ 4 \end{bmatrix}$ in the standard basis but as $\begin{bmatrix} 7/2 \ -1/2 \end{bmatrix}$ in the new basis. The vector itself hasn’t changed (it still points to the same location in space), just the numbers we use to describe it.

Figure 1.1: The same vector $\mathbf{v}$ in two different bases. Left: Standard basis where $\mathbf{v} = 3\mathbf{e}_1 + 4\mathbf{e}_2$, coordinates [3, 4]. Right: New basis where $\mathbf{v} = \frac{7}{2}\mathbf{b}_1 - \frac{1}{2}\mathbf{b}_2$, coordinates [7/2, -1/2]. The thick solid arrow ($\mathbf{v}$) points to the same location in both diagrams. Only the coordinate system has changed.

Notice that the thick solid arrow (our vector $\mathbf{v}$) points to exactly the same location in both diagrams. In the left diagram, we get there by going 3 steps along $\mathbf{e}_1$ (horizontal) then 4 steps along $\mathbf{e}_2$ (vertical). In the right diagram, we get to the same place by going 7/2 steps along $\mathbf{b}_1$ (diagonal up-right) then -1/2 steps along $\mathbf{b}_2$ (diagonal down-right). Different paths using different basis vectors, same destination. The coordinates [7/2, -1/2] don’t mean “the point at x=3.5, y=-0.5 in standard coordinates.” They mean “7/2 units along $\mathbf{b}_1$ and -1/2 units along $\mathbf{b}_2$”, which lands at the same spot as [3, 4] in standard coordinates.

Why does this matter? Some bases make certain patterns obvious while others obscure them. Consider a dataset of points arranged in an ellipse. In the standard $x$-$y$ basis, the pattern looks complicated. But if we rotate to a basis aligned with the ellipse’s major and minor axes, the pattern becomes simple: points lie within a certain distance along each axis. We’ve revealed structure by choosing the right basis.

In transformers, different basis representations correspond to different “views” of the same information. When we embed a word as a vector, that vector contains information about the word’s meaning, syntax, context, etc. But this information might not be easily accessible in the original basis. The attention mechanism, as we’ll see, is fundamentally about learning useful basis transformations. It projects vectors into new bases (via query, key, and value matrices) where relationships between words become clear. If we want to know “which words should attend to which,” we need a basis where similar words align, dissimilar words separate. Attention learns these transformations automatically from data. The matrix operations we’ll study aren’t just computational mechanics, they’re geometric transformations that reorganize information to make patterns visible.

1.2 The dot product and geometric intuition

The dot product (or inner product) of two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is:

\[ \mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T\mathbf{v} = \sum_{i=1}^n u_i v_i = u_1v_1 + u_2v_2 + \cdots + u_nv_n \]

The dot product has a beautiful geometric interpretation. If $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$, then:

\[ \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta \]

where $\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}$ is the Euclidean norm (length) of $\mathbf{v}$. This tells us the dot product measures how much two vectors point in the same direction. When $\theta = 0$ (parallel vectors), $\cos\theta = 1$ and the dot product is maximized. When $\theta = 90°$ (orthogonal vectors), $\cos\theta = 0$ and the dot product is zero. When $\theta = 180°$ (opposite directions), $\cos\theta = -1$ and the dot product is minimized.

Let’s prove this geometric interpretation. Consider the law of cosines applied to the triangle formed by vectors $\mathbf{u}$, $\mathbf{v}$, and $\mathbf{u} - \mathbf{v}$:

\[ \mathbf{u} - \mathbf{v}\|^2 = \mathbf{u}\|^2 + \mathbf{v}\|^2 - 2\mathbf{u}\|\|\mathbf{v}\|\cos\theta \]

Expanding the left side:

\[ \mathbf{u} - \mathbf{v}\|^2 = (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \mathbf{u} \cdot \mathbf{u} - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v} \cdot \mathbf{v} = \mathbf{u}\|^2 - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v}\|^2 \]

Equating the two expressions:

\[ \mathbf{u}\|^2 - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v}\|^2 = \mathbf{u}\|^2 + \mathbf{v}\|^2 - 2\mathbf{u}\|\|\mathbf{v}\|\cos\theta \]

Simplifying:

\[ -2\mathbf{u} \cdot \mathbf{v} = -2\mathbf{u}\|\|\mathbf{v}\|\cos\theta \]

\[ \mathbf{u} \cdot \mathbf{v} = \mathbf{u}\|\|\mathbf{v}\|\cos\theta \]

This geometric interpretation is crucial for understanding transformers. We constantly compute dot products between vectors to measure similarity. A large dot product means the vectors are similar (pointing in similar directions), while a small dot product means they’re dissimilar. This is how transformers decide which pieces of information are related and should interact.

1.3 Matrices and linear transformations

A matrix is a rectangular array of numbers. We write matrices as uppercase bold letters:

\[ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} \]

This matrix $\mathbf{A}$ has $m$ rows and $n$ columns, so we say $\mathbf{A} \in \mathbb{R}^{m \times n}$. Every matrix represents a linear transformation. When we multiply a vector $\mathbf{x} \in \mathbb{R}^n$ by matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, we get a vector $\mathbf{y} \in \mathbb{R}^m$:

\[ \mathbf{y} = \mathbf{A}\mathbf{x} \]

The $i$-th component of $\mathbf{y}$ is:

\[ y_i = \sum_{j=1}^n a_{ij} x_j \]

This is just the dot product of the $i$-th row of $\mathbf{A}$ with $\mathbf{x}$. We can think of matrix-vector multiplication in two equivalent ways:

Row perspective: Each element of $\mathbf{y}$ is a weighted combination of the elements of $\mathbf{x}$. Let’s see a concrete example:

\[ \begin{bmatrix} 1 & 2 & 3 \ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 7 \ 8 \ 9 \end{bmatrix} = \begin{bmatrix} 1 \cdot 7 + 2 \cdot 8 + 3 \cdot 9 \ 4 \cdot 7 + 5 \cdot 8 + 6 \cdot 9 \end{bmatrix} = \begin{bmatrix} 7 + 16 + 27 \ 28 + 40 + 54 \end{bmatrix} = \begin{bmatrix} 50 \ 122 \end{bmatrix} \]

The first component (50) comes from the dot product of the first row $[1, 2, 3]$ with the vector $[7, 8, 9]$. The second component (122) comes from the dot product of the second row $[4, 5, 6]$ with the same vector.

Column perspective: $\mathbf{y}$ is a linear combination of the columns of $\mathbf{A}$, with weights given by $\mathbf{x}$:

\[ \mathbf{y} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n \]

where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$. Using the same example:

\[ \begin{bmatrix} 1 & 2 & 3 \ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 7 \ 8 \ 9 \end{bmatrix} = 7 \begin{bmatrix} 1 \ 4 \end{bmatrix} + 8 \begin{bmatrix} 2 \ 5 \end{bmatrix} + 9 \begin{bmatrix} 3 \ 6 \end{bmatrix} = \begin{bmatrix} 7 \ 28 \end{bmatrix} + \begin{bmatrix} 16 \ 40 \end{bmatrix} + \begin{bmatrix} 27 \ 54 \end{bmatrix} = \begin{bmatrix} 50 \ 122 \end{bmatrix} \]

We’re taking 7 copies of the first column, 8 copies of the second column, and 9 copies of the third column, then adding them together. Same result, different interpretation.

Both perspectives are useful. In transformers, the row perspective helps us understand how each output dimension depends on the input. The column perspective helps us see matrix multiplication as mixing together column vectors.

When we multiply two matrices $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$, we get $\mathbf{C} = \mathbf{AB} \in \mathbb{R}^{m \times p}$:

\[ c_{ij} = \sum_{k=1}^n a_{ik} b_{kj} \]

Matrix multiplication is associative ($\mathbf{A}(\mathbf{BC}) = (\mathbf{AB})\mathbf{C}$) and distributive ($\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}$), but generally not commutative ($\mathbf{AB} \neq \mathbf{BA}$). The order matters. For example:

\[ \begin{bmatrix} 1 & 2 \ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 \ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 1 & 0 \end{bmatrix}, \quad \text{but} \quad \begin{bmatrix} 0 & 1 \ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 2 \ 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 1 \ 1 & 2 \end{bmatrix} \]

Different results! This is why we must be careful about order in transformer architectures.

The transpose of a matrix $\mathbf{A}$ is denoted $\mathbf{A}^T$ and defined by swapping rows and columns: $(A^T)_{ij} = a_{ji}$. Some useful properties:

\[ (\mathbf{A}^T)^T = \mathbf{A}, \quad (\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T, \quad (\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T \]

A matrix $\mathbf{A}$ is symmetric if $\mathbf{A}^T = \mathbf{A}$. Symmetric matrices have special properties we’ll encounter when we study attention patterns.

1.4 Norms and distances

We need a way to measure the “size” or “length” of vectors. In ordinary space, we use the Pythagorean theorem: a vector $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ has length $\sqrt{3^2 + 4^2} = \sqrt{9 + 16} = 5$. But there are other sensible ways to measure size, and different measures are useful in different contexts.

A norm is a function that assigns a non-negative size to each vector. For a function $\|\cdot\|$ to be a proper norm, it must satisfy three properties:

Positive definiteness: $\mathbf{v}\| \geq 0$ with equality if and only if $\mathbf{v} = \mathbf{0}$
Homogeneity: $\|a\mathbf{v}\| = |a|\|\mathbf{v}\|$ for any scalar $a$
Triangle inequality: $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$

Let’s understand what these mean. Positive definiteness says only the zero vector has zero size. Homogeneity says if you scale a vector by factor $a$, its size scales by $|a|$. The triangle inequality says the direct path from origin to $\mathbf{u} + \mathbf{v}$ is never longer than going via $\mathbf{u}$ first. This generalizes the geometric fact that a straight line is the shortest path.

What would violate these properties? Consider $f(\mathbf{v}) = \|\mathbf{v}\|_2 + 1$. This seems like a reasonable “size” measure, but it fails homogeneity: $f(2\mathbf{v}) = 2\|\mathbf{v}\|_2 + 1$, while $2f(\mathbf{v}) = 2\|\mathbf{v}\|_2 + 2$. The constant term breaks scaling. Or consider $g(\mathbf{v}) = \sum_i v_i$ (sum without absolute values). For $\mathbf{v} = \begin{bmatrix} 1 \ -1 \end{bmatrix}$, we get $g(\mathbf{v}) = 0$ despite $\mathbf{v} \neq \mathbf{0}$, violating positive definiteness.

1.4.1 Common norms

The most common norms form a family called $p$-norms:

\[ \mathbf{v}\|_p = \left(\sum_{i=1}^n |v_i|^p\right)^{1/p} \]

Different values of $p$ give different norms:

$L^1$ norm ($p = 1$): $\mathbf{v}\|_1 = \sum_{i=1}^n |v_i|$. This sums the absolute values of components. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_1 = |3| + |-4| = 7$. This is also called the Manhattan distance or taxicab norm, because it measures distance as if you could only travel along axis-aligned streets.

$L^2$ norm ($p = 2$): $\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^n v_i^2}$. This is the Euclidean norm, the “ordinary” geometric length. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_2 = \sqrt{9 + 16} = 5$. This is what we usually mean by “length” in everyday geometry.

$L^\infty$ norm ($p = \infty$): $\mathbf{v}\|_\infty = \max_{i} |v_i|$. This takes the largest absolute component. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_\infty = \max(3, 4) = 4$. The name comes from taking the limit as $p \to \infty$ in the $p$-norm formula.

Let’s verify these give different values on a concrete example. For $\mathbf{v} = \begin{bmatrix} 2 \ 3 \ -1 \end{bmatrix}$:

\[ \mathbf{v}\|_1 = 2 + 3 + 1 = 6, \quad \mathbf{v}\|_2 = \sqrt{4 + 9 + 1} = \sqrt{14} \approx 3.74, \quad \mathbf{v}\|_\infty = 3 \]

All three are valid norms. In transformers, we most commonly use the $L^2$ norm because it has nice geometric and optimization properties.

Let’s look more closely at the triangle inequality: $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$. This says the direct path from origin to $\mathbf{u} + \mathbf{v}$ is never longer than going via $\mathbf{u}$ first. Geometrically, it captures the fact that a straight line is the shortest path between two points.

What if this didn’t hold? Imagine a “norm” where $\|\mathbf{u} + \mathbf{v}\| > \|\mathbf{u}\| + \|\mathbf{v}\|$ for some vectors. Taking a detour would be shorter than going directly. This would break our geometric intuition about distance. For instance, if walking from A to B directly took 10 minutes, but walking from A to C then C to B took only 8 minutes, our notion of “distance” would be broken. You could keep finding shorter and shorter paths by adding more intermediate points, which makes no sense for measuring actual spatial distance. The triangle inequality ensures that distances behave sensibly: the direct route is always shortest (or at worst, tied).

1.4.2 Distance

Once we have a norm, we can define distance between vectors. The distance from $\mathbf{u}$ to $\mathbf{v}$ is simply the norm of their difference:

\[ d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\| \]

Different norms give different distance measures. Using $L^2$, we get Euclidean distance. Using $L^1$, we get Manhattan distance. In transformers, distances between embedding vectors indicate semantic similarity: words with similar meanings have embeddings that are close together.

1.4.3 Unit vectors and normalization

A unit vector is one with norm 1. We can convert any nonzero vector $\mathbf{v}$ into a unit vector pointing in the same direction by normalizing it:

\[ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} \]

Let’s verify this has norm 1: $\|\hat{\mathbf{v}}\| = \left\|\frac{\mathbf{v}}{\|\mathbf{v}\|}\right\| = \frac{1}{\|\mathbf{v}\|}\|\mathbf{v}\| = 1$ by homogeneity. For example, $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ has $\mathbf{v}\|_2 = 5$, so $\hat{\mathbf{v}} = \begin{bmatrix} 3/5 \ 4/5 \end{bmatrix}$ is the unit vector in the same direction.

Normalization is ubiquitous in transformers. Layer normalization rescales activation vectors to have controlled statistics (zero mean and unit variance). This stabilizes training by preventing activations from growing too large or shrinking too small. When we compute attention weights, we often normalize vectors before taking dot products to ensure numerical stability.

1.4.4 Matrix norms

We can also define norms for matrices. The simplest is the Frobenius norm, which treats the matrix as a long vector:

\[ \mathbf{A}\|_F = \sqrt{\sum_{i,j} a_{ij}^2} \]

For example, $\left\|\begin{bmatrix} 1 & 2 \ 3 & 4 \end{bmatrix}\right\|_F = \sqrt{1 + 4 + 9 + 16} = \sqrt{30}$. This measures the overall “size” of all matrix entries.

1.5 Matrix properties and decompositions

1.5.1 Rank

The rank of a matrix $\mathbf{A}$ is the dimension of the vector space spanned by its columns (equivalently, by its rows). Intuitively, rank measures the number of independent directions in the output. Consider $\mathbf{A} \in \mathbb{R}^{m \times n}$ as a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$. The rank tells us the dimension of the image: how much of the output space can we actually reach?

Let’s see concrete examples. The matrix $\mathbf{A} = \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}$ has rank 2. Both columns are linearly independent, and we can reach any point in $\mathbb{R}^2$ by multiplying appropriate vectors. But consider $\mathbf{B} = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$. The second column is twice the first: $\begin{bmatrix} 2 \ 4 \end{bmatrix} = 2\begin{bmatrix} 1 \ 2 \end{bmatrix}$.

Why does this mean we only span one dimension? Recall that $\mathbf{B}\mathbf{x}$ is a linear combination of $\mathbf{B}$’s columns using weights from $\mathbf{x}$. The first column points in direction $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, and the second column also points in direction $\begin{bmatrix} 1 \ 2 \end{bmatrix}$ (just scaled by 2). Both columns lie on the same line through the origin. Any combination of them must also lie on that same line. We’re mixing together two vectors that both point in the same direction, so the result can only point in that direction too. We can never “escape” the line to reach other parts of the 2D plane. The column space of $\mathbf{B}$ is just the line along $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, which is one-dimensional.

It has rank 1. For any vector $\mathbf{x} = \begin{bmatrix} x_1 \ x_2 \end{bmatrix}$:

\[ \mathbf{B}\mathbf{x} = x_1\begin{bmatrix} 1 \ 2 \end{bmatrix} + x_2\begin{bmatrix} 2 \ 4 \end{bmatrix} = (x_1 + 2x_2)\begin{bmatrix} 1 \ 2 \end{bmatrix} \]

All outputs lie on the same line. The transformation collapses the 2D plane onto a 1D line.

What does “information is lost” mean? Consider two different inputs:

\[ \mathbf{x}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \quad \mathbf{x}_2 = \begin{bmatrix} -1 \ 1 \end{bmatrix} \]

Let’s multiply both by $\mathbf{B}$:

\[ \mathbf{B}\mathbf{x}_1 = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}\begin{bmatrix} 1 \ 0 \end{bmatrix} = \begin{bmatrix} 1 \ 2 \end{bmatrix} \]

\[ \mathbf{B}\mathbf{x}_2 = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}\begin{bmatrix} -1 \ 1 \end{bmatrix} = \begin{bmatrix} -1 + 2 \ -2 + 4 \end{bmatrix} = \begin{bmatrix} 1 \ 2 \end{bmatrix} \]

The same output! Two completely different inputs map to the same result. If someone gives us the output $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, we can’t tell whether it came from $\mathbf{x}_1$, $\mathbf{x}_2$, or infinitely many other vectors. The original information about which input we started with is irrecoverably lost. This is why $\mathbf{B}$ has no inverse: we can’t “undo” the transformation because many inputs collapse to each output.

Geometrically, imagine the 2D plane collapsing onto a line. Every point perpendicular to the line direction gets squashed to zero in that direction. Points at $(1, 0)$, $(0, 0.5)$, $(-1, 1)$, and infinitely many others all land at the same spot on the output line. The dimension perpendicular to $\begin{bmatrix} 1 \ 2 \end{bmatrix}$ is completely annihilated.

A matrix is full rank if its rank equals the minimum of its dimensions. For a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, full rank means rank equals $n$. This is crucial: full rank square matrices are invertible. They don’t lose information, so we can “undo” the transformation. Non-full-rank matrices collapse space and are not invertible.

1.5.2 Determinant

The determinant tells us how a matrix transformation changes areas (in 2D) or volumes (in higher dimensions). For example, apply the matrix $\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$ to the unit square. It stretches horizontally by 2 and vertically by 3, producing a $2 \times 3$ rectangle with area 6. The determinant is $2 \cdot 3 - 0 \cdot 0 = 6$, exactly the area scaling factor. But why does this formula work? Let’s build deep intuition.

The unit square has edges along the standard basis vectors: one edge goes from $(0,0)$ to $(1,0)$ (along $\mathbf{e}_1$), another from $(0,0)$ to $(0,1)$ (along $\mathbf{e}_2$). When we apply a matrix $\mathbf{A}$, where do these edges go? The columns of $\mathbf{A}$ tell us:

\[ \mathbf{A} = \begin{bmatrix} a & b \ c & d \end{bmatrix} \implies \mathbf{e}_1 \to \begin{bmatrix} a \ c \end{bmatrix}, \quad \mathbf{e}_2 \to \begin{bmatrix} b \ d \end{bmatrix} \]

The unit square transforms into a parallelogram whose sides are the columns of $\mathbf{A}$. So the question “what area does the unit square become?” is equivalent to “what’s the area of the parallelogram spanned by the columns?”

The area of a parallelogram with sides $\mathbf{u}$ and $\mathbf{v}$ is: base times height, where height is the perpendicular distance. If $\mathbf{u} = [a, c]^T$ and $\mathbf{v} = [b, d]^T$:

Take $\mathbf{u}$ as the base, with length $\sqrt{a^2 + c^2}$
The height is how far $\mathbf{v}$ extends perpendicular to $\mathbf{u}$
After working through the geometry, this equals $\frac{|ad - bc|}{\sqrt{a^2 + c^2}}$
Multiplying base $\times$ height: $\sqrt{a^2 + c^2} \cdot \frac{|ad - bc|}{\sqrt{a^2 + c^2}} = |ad - bc|$

So the determinant formula:

\[ \det(\mathbf{A}) = ad - bc \]

is exactly the (signed) area of the parallelogram formed by the columns. The sign captures orientation: positive if the columns maintain the same rotational order as the original basis (counterclockwise), negative if they flip it.

Here’s another way to see it intuitively. The determinant measures how much “independent spreading” the columns do. If both columns point in similar directions, they don’t spread much, so the parallelogram is thin (small area). If they point in perpendicular directions, they spread maximally (large area). If they point in exactly the same direction, they don’t spread at all (zero area, the parallelogram collapses to a line).

Let’s see this concretely. Consider $\mathbf{A} = \begin{bmatrix} 3 & 1 \ 1 & 2 \end{bmatrix}$. The corners of the unit square transform as follows:

\[\begin{align} (0,0) &\to A\begin{bmatrix} 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ (1,0) &\to A\begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 1 \end{bmatrix} \\ (0,1) &\to A\begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (1,1) &\to A\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ 3 \end{bmatrix} \end{align}\]

The unit square becomes a parallelogram with corners at $(0,0)$, $(3,1)$, $(1,2)$, $(4,3)$. The area of a parallelogram with sides $\mathbf{u}$ and $\mathbf{v}$ is $| A \times A|$ (the magnitude of the cross product). For our parallelogram with sides $[3,1]^T$ and $[1,2]^T$:

\[ \text{area} = |3 \cdot 2 - 1 \cdot 1| = |6 - 1| = 5 \]

Now compute the determinant:

\[ \det(\mathbf{A}) = 3 \cdot 2 - 1 \cdot 1 = 6 - 1 = 5 \]

They match! The determinant directly gives us the area scaling factor. The original square had area 1, the transformed parallelogram has area 5, so $\mathbf{A}$ scales areas by a factor of 5.

This extends to any shape, not just the unit square. If we have a triangle with area 10, applying $\mathbf{A}$ transforms it to a new triangle with area $10 \times 5 = 50$. The determinant is the universal area scaling factor.

Now consider our rank-deficient matrix from earlier: $\mathbf{B} = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$. Let’s see what happens to the unit square:

\[\begin{align} (0,0) &\to \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ (1,0) &\to \begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (0,1) &\to \begin{bmatrix} 2 \\ 4 \end{bmatrix} = 2\begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (1,1) &\to \begin{bmatrix} 3 \\ 6 \end{bmatrix} = 3\begin{bmatrix} 1 \\ 2 \end{bmatrix} \end{align}\]

All four corners lie on the line through $\begin{bmatrix} 1 \ 2 \end{bmatrix}$! The square collapses to a line segment from $(0,0)$ to $(3,6)$. A line segment has zero area. The determinant:

\[ \det(\mathbf{B}) = 1 \cdot 4 - 2 \cdot 2 = 4 - 4 = 0 \]

Zero! This makes perfect sense. When a transformation collapses 2D shapes onto a 1D line, all areas become zero. In general, $\det(\mathbf{A}) = 0$ if and only if $\mathbf{A}$ is not full rank. The transformation loses a dimension, squashing everything flat.

When $\det(\mathbf{A}) \neq 0$, areas are scaled but not annihilated. This means the transformation doesn’t collapse space, so it’s invertible. We can undo it. When $\det(\mathbf{A}) = 0$, areas become zero, dimensions collapse, and we can’t reverse the process.

What about the sign of the determinant? Consider $\mathbf{R} = \begin{bmatrix} 1 & 0 \ 0 & -1 \end{bmatrix}$:

\[ \det(\mathbf{R}) = 1 \cdot (-1) - 0 \cdot 0 = -1 \]

This matrix flips the $y$-axis. The unit square still becomes a parallelogram with area $|-1| = 1$ (no scaling), but the orientation reverses. Imagine the square has corners labeled clockwise as $(0,0)$, $(1,0)$, $(1,1)$, $(0,1)$. After transformation, the corners appear in counterclockwise order. The sign tells us about orientation: positive preserves it, negative reverses it.

In higher dimensions, the pattern continues. For a $3 \times 3$ matrix:

\[ \mathbf{A} = \begin{bmatrix} a & b & c \ d & e & f \ g & h & i \end{bmatrix} \]

The determinant is:

\[ \det(\mathbf{A}) = a(ei - fh) - b(di - fg) + c(dh - eg) \]

This looks complex, but the principle is the same: it measures how much the transformation scales volumes. Let’s compute an example:

\[ \mathbf{A} = \begin{bmatrix} 2 & 0 & 0 \ 0 & 3 & 0 \ 0 & 0 & 1 \end{bmatrix} \]

This is a diagonal matrix that scales the $x$-axis by 2, $y$-axis by 3, and leaves $z$-axis unchanged:

\[ \det(\mathbf{A}) = 2(3 \cdot 1 - 0 \cdot 0) - 0(\cdots) + 0(\cdots) = 2 \cdot 3 = 6 \]

The unit cube (volume 1) gets stretched to a box with dimensions $2 \times 3 \times 1$, which has volume 6. For diagonal matrices, the determinant is simply the product of diagonal entries: $\det(\mathbf{A}) = 2 \cdot 3 \cdot 1 = 6$. This makes intuitive sense: if you scale each dimension independently, the total volume scales by the product of all scaling factors.

The general principle: $\det(\mathbf{A})$ is the factor by which volumes get multiplied. Zero determinant means collapse to lower dimension. Negative means orientation reversal. This holds in any dimension.

1.5.3 Eigenvalues and eigenvectors

An eigenvalue of $\mathbf{A}$ is a scalar $\lambda$ such that:

\[ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} \]

for some nonzero vector $\mathbf{v}$, called an eigenvector. This is a profound concept: when we apply $\mathbf{A}$ to eigenvector $\mathbf{v}$, the result points in the same direction as $\mathbf{v}$, just scaled by $\lambda$. Most vectors get rotated and stretched in complicated ways when multiplied by a matrix. Eigenvectors are special directions that only get scaled.

Why does this matter? Eigenvectors reveal the natural “axes” of a transformation. If we express vectors in the eigenvector basis, matrix multiplication becomes trivial: each component just gets multiplied by its corresponding eigenvalue. This is why eigendecomposition is so powerful.

Let’s find the eigenvalues and eigenvectors of $\mathbf{A} = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}$. We need $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$, which we can rewrite as:

\[ \mathbf{A}\mathbf{v} - \lambda\mathbf{v} = \mathbf{0} \]

\[ (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0} \]

For a nonzero solution $\mathbf{v}$ to exist, the matrix $\mathbf{A} - \lambda\mathbf{I}$ must not be invertible, so:

\[ \det(\mathbf{A} - \lambda\mathbf{I}) = 0 \]

For our example:

\[ \det\left(\begin{bmatrix} 4-\lambda & 1 \ 2 & 3-\lambda \end{bmatrix}\right) = (4-\lambda)(3-\lambda) - 2 = 0 \]

\[ 12 - 4\lambda - 3\lambda + \lambda^2 - 2 = 0 \]

\[ \lambda^2 - 7\lambda + 10 = 0 \]

\[ (\lambda - 5)(\lambda - 2) = 0 \]

The eigenvalues are $\lambda_1 = 5$ and $\lambda_2 = 2$. Now let’s find the eigenvectors.

For $\lambda_1 = 5$:

\[ (\mathbf{A} - 5\mathbf{I})\mathbf{v} = \begin{bmatrix} -1 & 1 \ 2 & -2 \end{bmatrix}\begin{bmatrix} v_1 \ v_2 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix} \]

From the first row: $-v_1 + v_2 = 0$, so $v_2 = v_1$. One eigenvector is $\mathbf{v}_1 = \begin{bmatrix} 1 \ 1 \end{bmatrix}$.

For $\lambda_2 = 2$:

\[ (\mathbf{A} - 2\mathbf{I})\mathbf{v} = \begin{bmatrix} 2 & 1 \ 2 & 1 \end{bmatrix}\begin{bmatrix} v_1 \ v_2 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix} \]

From the first row: $2v_1 + v_2 = 0$, so $v_2 = -2v_1$. One eigenvector is $\mathbf{v}_2 = \begin{bmatrix} 1 \ -2 \end{bmatrix}$.

Let’s verify:

\[ \mathbf{A}\mathbf{v}_1 = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \ 1 \end{bmatrix} = \begin{bmatrix} 5 \ 5 \end{bmatrix} = 5\begin{bmatrix} 1 \ 1 \end{bmatrix} \quad \checkmark \]

\[ \mathbf{A}\mathbf{v}_2 = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 2 \ -4 \end{bmatrix} = 2\begin{bmatrix} 1 \ -2 \end{bmatrix} \quad \checkmark \]

Perfect! When we apply $\mathbf{A}$ to $\mathbf{v}_1$, it gets scaled by 5. When we apply $\mathbf{A}$ to $\mathbf{v}_2$, it gets scaled by 2. These are the natural directions of the transformation.

Here’s where eigenvectors become powerful: they simplify matrix multiplication. Any vector $\mathbf{x}$ can be written as a combination of eigenvectors: $\mathbf{x} = c_1\mathbf{v}_1 + c_2\mathbf{v}_2$. Then:

\[ \mathbf{A}\mathbf{x} = \mathbf{A}(c_1\mathbf{v}_1 + c_2\mathbf{v}_2) = c_1\mathbf{A}\mathbf{v}_1 + c_2\mathbf{A}\mathbf{v}_2 = c_1\lambda_1\mathbf{v}_1 + c_2\lambda_2\mathbf{v}_2 \]

Instead of doing full matrix multiplication, we just multiply each coefficient by its eigenvalue! Let’s try this with $\mathbf{x} = \begin{bmatrix} 3 \ 1 \end{bmatrix}$. First, express $\mathbf{x}$ in the eigenvector basis. We need $c_1$ and $c_2$ such that:

\[ c_1\begin{bmatrix} 1 \ 1 \end{bmatrix} + c_2\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 3 \ 1 \end{bmatrix} \]

This gives $c_1 + c_2 = 3$ and $c_1 - 2c_2 = 1$. Solving: $c_1 = 7/3$ and $c_2 = 2/3$. Now applying $\mathbf{A}$ is trivial:

\[ \mathbf{A}\mathbf{x} = \frac{7}{3} \cdot 5 \cdot \begin{bmatrix} 1 \ 1 \end{bmatrix} + \frac{2}{3} \cdot 2 \cdot \begin{bmatrix} 1 \ -2 \end{bmatrix} = \frac{35}{3}\begin{bmatrix} 1 \ 1 \end{bmatrix} + \frac{4}{3}\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 39/3 \ 27/3 \end{bmatrix} = \begin{bmatrix} 13 \ 9 \end{bmatrix} \]

Let’s verify by direct multiplication:

\[ \mathbf{A}\mathbf{x} = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 3 \ 1 \end{bmatrix} = \begin{bmatrix} 12 + 1 \ 6 + 3 \end{bmatrix} = \begin{bmatrix} 13 \ 9 \end{bmatrix} \quad \checkmark \]

The eigenvector approach seems longer here, but for repeated multiplications (like $\mathbf{A}^{100}\mathbf{x}$), it’s vastly simpler: just raise each eigenvalue to the power. In the eigenvector basis, $\mathbf{A}^{100}\mathbf{x} = c_1 \cdot 5^{100} \cdot \mathbf{v}_1 + c_2 \cdot 2^{100} \cdot \mathbf{v}_2$. No need to multiply the matrix by itself 100 times.

What if a matrix has no real eigenvalues? Consider a rotation matrix $\mathbf{R} = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix}$ for $\theta \neq 0, \pi$. This rotates vectors, so no direction stays on the same line. There are no real eigenvectors. The eigenvalues are complex: $\lambda = \cos\theta \pm i\sin\theta = e^{\pm i\theta}$. Complex eigenvalues indicate rotational behavior.

In transformers, eigenvalues appear in several ways. Weight matrices have eigenvalue spectra that affect gradient flow during training. Large eigenvalues can cause exploding gradients, while very small ones cause vanishing gradients. Normalization techniques (like layer norm) can be understood as controlling these eigenvalue distributions. When we analyze attention patterns, the eigenstructure of attention weight matrices reveals dominant patterns of information flow.

1.5.4 Orthogonal matrices

An orthogonal matrix $\mathbf{Q}$ satisfies $\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$. This means the columns of $\mathbf{Q}$ form an orthonormal basis: each column has norm 1, and different columns are perpendicular. Orthogonal matrices represent pure rotations and reflections without any scaling or skewing.

The key property: orthogonal matrices preserve lengths and angles. For any vectors $\mathbf{u}, \mathbf{v}$:

\[ \mathbf{Q}\mathbf{v}\|_2 = \|\mathbf{v}\|_2, \quad (\mathbf{Q}\mathbf{u}) \cdot (\mathbf{Q}\mathbf{v}) = \mathbf{u} \cdot \mathbf{v} \]

This makes sense geometrically: rotations and reflections don’t change distances or angles, only orientation. The determinant of an orthogonal matrix is always $\pm 1$: rotations have $\det(\mathbf{Q}) = 1$, reflections have $\det(\mathbf{Q}) = -1$. No volume scaling occurs.

1.6 Intuition: The matrix as a universal lens

We have covered eigenvalues, basis changes, and dot products. Before we move on to calculus, we must pause and internalize a crucial intuition.

In traditional mathematics, we often treat matrices as static tables of data or simple systems of equations. In deep learning we must view matrices differently.

A matrix is a machine. It is a lens.

Every time you see a matrix multiplication $\mathbf{y} = \mathbf{W}\mathbf{x}$ in a neural network, the matrix $\mathbf{W}$ is performing a specific physical action on the vector $\mathbf{x}$. It is not just multiplying numbers; it is transforming information.

Crucially, the nature of this transformation is dictated by the shape of the matrix. Specifically, it depends on how the number of outputs compares to the number of inputs.

1.6.1 1. The compressor (bottleneck)

The first type of lens appears when a matrix maps a large vector to a smaller one (many inputs to fewer outputs). This forces lossy compression. It is physically impossible to keep all the information from the larger space, so the matrix must make choices about what to keep and what to throw away.

Mathematically, if we have an input $\mathbf{x} \in \mathbb{R}^{512}$ and a matrix $\mathbf{W} \in \mathbb{R}^{64 \times 512}$, the output $\mathbf{y}$ has only 64 dimensions. Let’s look at the operation row by row. If we call the rows of the matrix $\mathbf{r}_1, \mathbf{r}_2, \dots, \mathbf{r}_{64}$, the multiplication looks like this:

\[ \mathbf{y} = \mathbf{W}\mathbf{x} = \begin{bmatrix} \text{--- } \mathbf{r}_1 \text{ ---} \\ \text{--- } \mathbf{r}_2 \text{ ---} \\ \vdots \\ \text{--- } \mathbf{r}_{64} \text{ ---} \end{bmatrix} \mathbf{x} = \begin{bmatrix} \mathbf{r}_1 \cdot \mathbf{x} \\ \mathbf{r}_2 \cdot \mathbf{x} \\ \vdots \\ \mathbf{r}_{64} \cdot \mathbf{x} \end{bmatrix} \]

Each element $y_i$ is simply the dot product $\mathbf{r}_i \cdot \mathbf{x}$. Recall that the dot product is a similarity test. If $\mathbf{x}$ is perpendicular (orthogonal) to a row $\mathbf{r}_i$, the result is 0.

Now, imagine a part of vector $\mathbf{x}$ that points in a direction completely different from all 64 rows. It is orthogonal to every single row vector. * $\mathbf{r}_1 \cdot \mathbf{x} = 0$ * $\mathbf{r}_2 \cdot \mathbf{x} = 0$ * … * $\mathbf{r}_{64} \cdot \mathbf{x} = 0$

The result is the zero vector $\mathbf{0}$. This is exactly what the null space is: the set of all inputs that get “annihilated” (mapped to zero) because they don’t align with any of the matrix’s feature detectors. The matrix is blind to these directions. It destroys that information completely.

Imagine describing a complex scene, like a busy city street, to a friend over a noisy phone line. You have 10 seconds. You cannot list every photon of light or every cracked pavement stone. You must prioritize: “Red car. Speeding. Police chasing.”

A matrix that reduces dimension acts as this filter. By reducing the available space, we force the model to identify the essence of the input. It strips away noise, nuance, and irrelevant details. The matrix learns to be a “feature detector” for the most critical patterns. Only inputs matching these patterns survive the projection; everything else is ignored. In practice, we use this to distill a complex object (like a word with many definitions) into a focused representation of just one specific aspect (like its grammatical role).

1.6.2 2. The expander (unfolding)

The second type of lens appears when a matrix maps a small vector to a larger one (few inputs to many outputs). This creates space for analysis. It allows the model to “unpack” information that was tightly compressed.

Mathematically, consider mapping $\mathbf{x} \in \mathbb{R}^{512}$ to $\mathbf{y} \in \mathbb{R}^{2048}$. We define 2048 new direction vectors (the rows). Because we have more outputs than inputs, we are generating an over-complete representation.

Does this create “new” information? No. The output vector $\mathbf{y}$ lives in a high-dimensional space ($\mathbb{R}^{2048}$), but it is constrained to a lower-dimensional subspace (specifically, a 512-dimensional flat sheet called the column space or image of the matrix). You cannot reach every point in the 2048-dimensional space, only those that can be formed by combining the columns of $\mathbf{W}$.

So why bother? This is similar to adding polynomial features in regression. Imagine you have points on a 1D line that are red-blue-red. You cannot separate them with a single straight cut (a linear classifier). But if you map each point $x$ to a 2D vector $[x, x^2]$, the points lift onto a parabola. Now, a simple straight line can slice through the parabola to separate red from blue.

The expander matrix performs a similar “lifting” operation. It computes 2048 distinct linear combinations of the original features. It effectively says: “Let’s look at the data from 2048 different angles simultaneously.” By projecting the data onto this higher-dimensional manifold, we increase the probability that complex, entangled patterns will become linearly separable, allowing the subsequent layers (like ReLU) to slice them apart cleanly.

Think of a crumpled piece of paper with writing on it. In its compressed ball state, the words touch and overlap so you cannot read them. To understand it, you must unfold it into a larger flat space. Or think of separate ingredients like flour, sugar, and eggs versus a mixed batter. To chemically analyze the batter, you might need to separate the components back out.

A matrix that increases dimension creates this “wiggle room.” By increasing the space, we make it possible to separate complex patterns that were entangled in the lower dimension. We project the data into a high-dimensional space where it is easier to categorize. The matrix generates an “over-complete” representation, computing many distinct combinations of the input features to create a massive menu of potential patterns to look for. In practice, we use this to perform complex logical operations. We expand the data, inspect it in detail, and then compress it back down.

1.6.3 3. The mixer (rotation/perspective)

The third type of lens appears when the input and output sizes are the same. Since the matrix isn’t compressing or expanding capacity, it is instead translating languages. It acts as a rotation or a change of perspective.

Mathematically, if $\mathbf{W} \in \mathbb{R}^{512 \times 512}$ is invertible (full rank), it performs a change of basis. The output $\mathbf{y}$ contains exactly the same amount of information as the input $\mathbf{x}$, just reorganized. We can write $\mathbf{y}$ as a linear combination of the columns of $\mathbf{W}$: $\mathbf{y} = x_1\mathbf{c}_1 + x_2\mathbf{c}_2 + \dots + x_n\mathbf{c}_n$. The matrix effectively rotates the vector space, aligning the data’s internal axes with the standard axes that the next layer expects. No information is lost (null space is zero), and no extra space is created; the data just “turns” to face a new direction.

Think of holding a map upside down. The information is all there. The distances are correct and the landmarks exist. But it is useless for navigation because the orientation doesn’t match your reality. You need to rotate it.

A square matrix performs this re-orientation. It mixes the independent channels of the vector, essentially saying: “Don’t look at Feature 1 and Feature 2 in isolation; look at their sum and their difference.” It acts as a switchboard or a mixing desk, routing information from where it was computed to where it is needed next. In practice, we use this to integrate information, taking distinct, segregated reports (like “Subject is John” and “Verb is Run”) and mixing them into a single, unified meaning.

1.6.4 Summary

As we progress to neural networks, never look at a weight matrix $\mathbf{W}$ as just a bag of numbers. Look at its shape.

If it is shrinking the vector, it is a Summarizer forcing the model to decide what matters. If it is growing the vector, it is an Analyzer trying to untangle complex relationships. If it is keeping the size, it is a Translator reorganizing information for the next step.

This “lens” intuition is more powerful than memorizing formulas because it tells you the intent of each component in the architecture.

# Linear algebra essentials ::: {.callout-note appearance="simple"} ## Learning objectives After completing this chapter, you will be able to: - Define vectors, vector spaces, and linear independence - Compute dot products and interpret them geometrically as similarity measures - Perform matrix multiplication and understand it as a linear transformation - Calculate eigenvalues and eigenvectors, and understand their geometric meaning - Apply these concepts to understand how neural networks transform data ::: Before we dive into transformers, we need to establish a solid mathematical foundation. Linear algebra is the language of neural networks—every operation in a transformer can be understood as a matrix operation. This chapter reviews the essential concepts we'll use throughout the book. ## Vectors and vector spaces A vector is an ordered list of numbers. We write vectors as column matrices and denote them with lowercase bold letters: $$ \mathbf{v} = \begin{bmatrix} v_1 \ v_2 \ \vdots \ v_n \end{bmatrix} $$ The set of all n-dimensional real vectors forms a vector space $\mathbb{R}^n$. A vector space must satisfy certain properties. If $\mathbf{u}, \mathbf{v}, \mathbf{w}$ are vectors and $a, b$ are scalars, then: 1. **Closure under addition**: $\mathbf{u} + \mathbf{v}$ is also in the space 2. **Closure under scalar multiplication**: $a\mathbf{v}$ is also in the space 3. **Associativity**: $(\mathbf{u} + \mathbf{v}) + \mathbf{w} = \mathbf{u} + (\mathbf{v} + \mathbf{w})$ 4. **Commutativity**: $\mathbf{u} + \mathbf{v} = \mathbf{v} + \mathbf{u}$ 5. **Identity**: There exists a zero vector $\mathbf{0}$ such that $\mathbf{v} + \mathbf{0} = \mathbf{v}$ 6. **Inverse**: For every $\mathbf{v}$ there exists $-\mathbf{v}$ such that $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$ 7. **Distributivity**: $a(\mathbf{u} + \mathbf{v}) = a\mathbf{u} + a\mathbf{v}$ and $(a + b)\mathbf{v} = a\mathbf{v} + b\mathbf{v}$ 8. **Scalar multiplication associativity**: $a(b\mathbf{v}) = (ab)\mathbf{v}$ 9. **Scalar identity**: $1\mathbf{v} = \mathbf{v}$ Why do we care about these properties? Because they guarantee we can perform algebraic manipulations safely. In transformers, we'll constantly be adding vectors (combining information) and scaling them (adjusting magnitudes), so we need these operations to behave predictably. Before we continue, let's clarify what "linear" means. We'll encounter this word everywhere: linear combinations, linear independence, linear transformations. The term "linear" captures a fundamental idea: operations that respect scaling and addition. Specifically, a function or operation $f$ is linear if it satisfies two properties: 1. **Scaling**: $f(a\mathbf{v}) = af(\mathbf{v})$ for any scalar $a$ 2. **Addition**: $f(\mathbf{u} + \mathbf{v}) = f(\mathbf{u}) + f(\mathbf{v})$ These can be combined into one property: $f(a\mathbf{u} + b\mathbf{v}) = af(\mathbf{u}) + bf(\mathbf{v})$. Linear operations are simple in a precise sense. They don't have interactions or nonlinear terms. If you double the input, you double the output. If you add two inputs, you can process them separately and add the results. This makes linear operations tractable to analyze mathematically, which is why we study them first. Of course, transformers are not purely linear (otherwise they'd be very limited), but understanding the linear parts is essential before we add nonlinearity. What violates linearity? Consider $f(x) = x^2$. This fails scaling: $f(2x) = (2x)^2 = 4x^2$, but $2f(x) = 2x^2$. The squaring creates an extra factor. Or consider $f(x) = x + 1$. This fails the zero test: a linear function must map zero to zero (since $f(0) = f(0 \cdot \mathbf{v}) = 0 \cdot f(\mathbf{v}) = 0$), but $f(0) = 1 \neq 0$. Any constant shift breaks linearity. A **linear combination** of vectors $\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$ is any expression of the form $c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k$ where $c_1, c_2, \ldots, c_k$ are scalars. We're mixing the vectors together with different weights. This is called "linear" because the relationship between the coefficients $c_i$ and the result is linear: if you double all coefficients, you double the result. If you add two linear combinations, you get another linear combination. A set of vectors {$\mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_k$} is **linearly independent** if none of them is redundant. More precisely, no vector in the set can be obtained by mixing together the others. For example, in $\mathbb{R}^2$, the vectors $\mathbf{v}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}$ and $\mathbf{v}_2 = \begin{bmatrix} 0 \ 1 \end{bmatrix}$ are linearly independent because you can't get $\mathbf{v}_1$ by scaling $\mathbf{v}_2$, and vice versa. But $\mathbf{v}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}$, $\mathbf{v}_2 = \begin{bmatrix} 0 \ 1 \end{bmatrix}$, and $\mathbf{v}_3 = \begin{bmatrix} 2 \ 3 \end{bmatrix}$ are linearly dependent because $\mathbf{v}_3 = 2\mathbf{v}_1 + 3\mathbf{v}_2$. The formal definition captures this idea precisely. The vectors are linearly independent if the only way to make the zero vector from a linear combination is to use all zero coefficients: $$c_1\mathbf{v}_1 + c_2\mathbf{v}_2 + \cdots + c_k\mathbf{v}_k = \mathbf{0}$$ has $c_1 = c_2 = \cdots = c_k = 0$ as its only solution. Why does this work? Suppose we could write one vector, say $\mathbf{v}_1$, as a combination of the others: $\mathbf{v}_1 = a_2\mathbf{v}_2 + \cdots + a_k\mathbf{v}_k$. Then we could rearrange to get $\mathbf{v}_1 - a_2\mathbf{v}_2 - \cdots - a_k\mathbf{v}_k = \mathbf{0}$, which is a linear combination equaling zero with nonzero coefficients. So if any vector is redundant (expressible via the others), we can find a nonzero combination that equals zero. The formal definition says this can't happen. A **basis** for a vector space is a linearly independent set of vectors that spans the entire space. The number of vectors in a basis is the **dimension** of the space. For $\mathbb{R}^n$, there are infinitely many possible bases. The most common one is the **standard basis**: $$ \mathbf{e}_1 = \begin{bmatrix} 1 \ 0 \ \vdots \ 0 \end{bmatrix}, \quad \mathbf{e}_2 = \begin{bmatrix} 0 \ 1 \ \vdots \ 0 \end{bmatrix}, \quad \ldots, \quad \mathbf{e}_n = \begin{bmatrix} 0 \ 0 \ \vdots \ 1 \end{bmatrix} $$ But we could equally well use any other set of $n$ linearly independent vectors. For example, in $\mathbb{R}^2$, the vectors $\begin{bmatrix} 1 \ 1 \end{bmatrix}$ and $\begin{bmatrix} 1 \ -1 \end{bmatrix}$ form a perfectly valid basis, just a different one from the standard basis. Every vector $\mathbf{v} \in \mathbb{R}^n$ can be uniquely written as $\mathbf{v} = v_1\mathbf{e}_1 + v_2\mathbf{e}_2 + \cdots + v_n\mathbf{e}_n$. Here's a crucial insight for transformers: the same vector can be expressed in different bases, giving different coordinate representations of the same underlying information. Let's work through a concrete example to see exactly what this means. Consider the vector $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ in the standard basis. This notation means: $$ \mathbf{v} = 3\begin{bmatrix} 1 \ 0 \end{bmatrix} + 4\begin{bmatrix} 0 \ 1 \end{bmatrix} $$ Now let's use a different basis: $\mathbf{b}_1 = \begin{bmatrix} 1 \ 1 \end{bmatrix}$ and $\mathbf{b}_2 = \begin{bmatrix} 1 \ -1 \end{bmatrix}$. These vectors are linearly independent (you can't get one by scaling the other), so they form a valid basis for $\mathbb{R}^2$. What are the coordinates of $\mathbf{v}$ in this new basis? We need to find $c_1$ and $c_2$ such that: $$ c_1\begin{bmatrix} 1 \ 1 \end{bmatrix} + c_2\begin{bmatrix} 1 \ -1 \end{bmatrix} = \begin{bmatrix} 3 \ 4 \end{bmatrix} $$ This gives us two equations: $$ c_1 + c_2 = 3 $$ $$ c_1 - c_2 = 4 $$ Adding these equations: $2c_1 = 7$, so $c_1 = 7/2$. Subtracting: $2c_2 = -1$, so $c_2 = -1/2$. Let's verify: $$ \frac{7}{2}\begin{bmatrix} 1 \ 1 \end{bmatrix} + \left(-\frac{1}{2}\right)\begin{bmatrix} 1 \ -1 \end{bmatrix} = \begin{bmatrix} 7/2 \ 7/2 \end{bmatrix} + \begin{bmatrix} -1/2 \ 1/2 \end{bmatrix} = \begin{bmatrix} 3 \ 4 \end{bmatrix} \quad \checkmark $$ The same geometric vector is represented as $\begin{bmatrix} 3 \ 4 \end{bmatrix}$ in the standard basis but as $\begin{bmatrix} 7/2 \ -1/2 \end{bmatrix}$ in the new basis. The vector itself hasn't changed (it still points to the same location in space), just the numbers we use to describe it. ![The same vector $\mathbf{v}$ in two different bases. Left: Standard basis where $\mathbf{v} = 3\mathbf{e}_1 + 4\mathbf{e}_2$, coordinates [3, 4]. Right: New basis where $\mathbf{v} = \frac{7}{2}\mathbf{b}_1 - \frac{1}{2}\mathbf{b}_2$, coordinates [7/2, -1/2]. The thick solid arrow ($\mathbf{v}$) points to the same location in both diagrams. Only the coordinate system has changed.](/diagrams/svg/basis-change.svg){#fig-basis-change} Notice that the thick solid arrow (our vector $\mathbf{v}$) points to exactly the same location in both diagrams. In the left diagram, we get there by going 3 steps along $\mathbf{e}_1$ (horizontal) then 4 steps along $\mathbf{e}_2$ (vertical). In the right diagram, we get to the same place by going 7/2 steps along $\mathbf{b}_1$ (diagonal up-right) then -1/2 steps along $\mathbf{b}_2$ (diagonal down-right). Different paths using different basis vectors, same destination. The coordinates [7/2, -1/2] don't mean "the point at x=3.5, y=-0.5 in standard coordinates." They mean "7/2 units along $\mathbf{b}_1$ and -1/2 units along $\mathbf{b}_2$", which lands at the same spot as [3, 4] in standard coordinates. Why does this matter? Some bases make certain patterns obvious while others obscure them. Consider a dataset of points arranged in an ellipse. In the standard $x$-$y$ basis, the pattern looks complicated. But if we rotate to a basis aligned with the ellipse's major and minor axes, the pattern becomes simple: points lie within a certain distance along each axis. We've revealed structure by choosing the right basis. In transformers, different basis representations correspond to different "views" of the same information. When we embed a word as a vector, that vector contains information about the word's meaning, syntax, context, etc. But this information might not be easily accessible in the original basis. The attention mechanism, as we'll see, is fundamentally about learning useful basis transformations. It projects vectors into new bases (via query, key, and value matrices) where relationships between words become clear. If we want to know "which words should attend to which," we need a basis where similar words align, dissimilar words separate. Attention learns these transformations automatically from data. The matrix operations we'll study aren't just computational mechanics, they're geometric transformations that reorganize information to make patterns visible. ## The dot product and geometric intuition The dot product (or inner product) of two vectors $\mathbf{u}, \mathbf{v} \in \mathbb{R}^n$ is: $$ \mathbf{u} \cdot \mathbf{v} = \mathbf{u}^T\mathbf{v} = \sum_{i=1}^n u_i v_i = u_1v_1 + u_2v_2 + \cdots + u_nv_n $$ The dot product has a beautiful geometric interpretation. If $\theta$ is the angle between $\mathbf{u}$ and $\mathbf{v}$, then: $$ \mathbf{u} \cdot \mathbf{v} = \|\mathbf{u}\| \|\mathbf{v}\| \cos\theta $$ where $\mathbf{v}\| = \sqrt{v_1^2 + v_2^2 + \cdots + v_n^2}$ is the **Euclidean norm** (length) of $\mathbf{v}$. This tells us the dot product measures how much two vectors point in the same direction. When $\theta = 0$ (parallel vectors), $\cos\theta = 1$ and the dot product is maximized. When $\theta = 90°$ (orthogonal vectors), $\cos\theta = 0$ and the dot product is zero. When $\theta = 180°$ (opposite directions), $\cos\theta = -1$ and the dot product is minimized. Let's prove this geometric interpretation. Consider the law of cosines applied to the triangle formed by vectors $\mathbf{u}$, $\mathbf{v}$, and $\mathbf{u} - \mathbf{v}$: $$ \mathbf{u} - \mathbf{v}\|^2 = \mathbf{u}\|^2 + \mathbf{v}\|^2 - 2\mathbf{u}\|\|\mathbf{v}\|\cos\theta $$ Expanding the left side: $$ \mathbf{u} - \mathbf{v}\|^2 = (\mathbf{u} - \mathbf{v}) \cdot (\mathbf{u} - \mathbf{v}) = \mathbf{u} \cdot \mathbf{u} - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v} \cdot \mathbf{v} = \mathbf{u}\|^2 - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v}\|^2 $$ Equating the two expressions: $$ \mathbf{u}\|^2 - 2\mathbf{u} \cdot \mathbf{v} + \mathbf{v}\|^2 = \mathbf{u}\|^2 + \mathbf{v}\|^2 - 2\mathbf{u}\|\|\mathbf{v}\|\cos\theta $$ Simplifying: $$ -2\mathbf{u} \cdot \mathbf{v} = -2\mathbf{u}\|\|\mathbf{v}\|\cos\theta $$ $$ \mathbf{u} \cdot \mathbf{v} = \mathbf{u}\|\|\mathbf{v}\|\cos\theta $$ This geometric interpretation is crucial for understanding transformers. We constantly compute dot products between vectors to measure similarity. A large dot product means the vectors are similar (pointing in similar directions), while a small dot product means they're dissimilar. This is how transformers decide which pieces of information are related and should interact. ## Matrices and linear transformations A matrix is a rectangular array of numbers. We write matrices as uppercase bold letters: $$ \mathbf{A} = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \\ \end{bmatrix} $$ This matrix $\mathbf{A}$ has $m$ rows and $n$ columns, so we say $\mathbf{A} \in \mathbb{R}^{m \times n}$. Every matrix represents a linear transformation. When we multiply a vector $\mathbf{x} \in \mathbb{R}^n$ by matrix $\mathbf{A} \in \mathbb{R}^{m \times n}$, we get a vector $\mathbf{y} \in \mathbb{R}^m$: $$ \mathbf{y} = \mathbf{A}\mathbf{x} $$ The $i$-th component of $\mathbf{y}$ is: $$ y_i = \sum_{j=1}^n a_{ij} x_j $$ This is just the dot product of the $i$-th row of $\mathbf{A}$ with $\mathbf{x}$. We can think of matrix-vector multiplication in two equivalent ways: 1. **Row perspective**: Each element of $\mathbf{y}$ is a weighted combination of the elements of $\mathbf{x}$. Let's see a concrete example: $$ \begin{bmatrix} 1 & 2 & 3 \ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 7 \ 8 \ 9 \end{bmatrix} = \begin{bmatrix} 1 \cdot 7 + 2 \cdot 8 + 3 \cdot 9 \ 4 \cdot 7 + 5 \cdot 8 + 6 \cdot 9 \end{bmatrix} = \begin{bmatrix} 7 + 16 + 27 \ 28 + 40 + 54 \end{bmatrix} = \begin{bmatrix} 50 \ 122 \end{bmatrix} $$ The first component (50) comes from the dot product of the first row $[1, 2, 3]$ with the vector $[7, 8, 9]$. The second component (122) comes from the dot product of the second row $[4, 5, 6]$ with the same vector. 2. **Column perspective**: $\mathbf{y}$ is a linear combination of the columns of $\mathbf{A}$, with weights given by $\mathbf{x}$: $$ \mathbf{y} = x_1 \mathbf{a}_1 + x_2 \mathbf{a}_2 + \cdots + x_n \mathbf{a}_n $$ where $\mathbf{a}_j$ is the $j$-th column of $\mathbf{A}$. Using the same example: $$ \begin{bmatrix} 1 & 2 & 3 \ 4 & 5 & 6 \end{bmatrix} \begin{bmatrix} 7 \ 8 \ 9 \end{bmatrix} = 7 \begin{bmatrix} 1 \ 4 \end{bmatrix} + 8 \begin{bmatrix} 2 \ 5 \end{bmatrix} + 9 \begin{bmatrix} 3 \ 6 \end{bmatrix} = \begin{bmatrix} 7 \ 28 \end{bmatrix} + \begin{bmatrix} 16 \ 40 \end{bmatrix} + \begin{bmatrix} 27 \ 54 \end{bmatrix} = \begin{bmatrix} 50 \ 122 \end{bmatrix} $$ We're taking 7 copies of the first column, 8 copies of the second column, and 9 copies of the third column, then adding them together. Same result, different interpretation. Both perspectives are useful. In transformers, the row perspective helps us understand how each output dimension depends on the input. The column perspective helps us see matrix multiplication as mixing together column vectors. When we multiply two matrices $\mathbf{A} \in \mathbb{R}^{m \times n}$ and $\mathbf{B} \in \mathbb{R}^{n \times p}$, we get $\mathbf{C} = \mathbf{AB} \in \mathbb{R}^{m \times p}$: $$ c_{ij} = \sum_{k=1}^n a_{ik} b_{kj} $$ Matrix multiplication is associative ($\mathbf{A}(\mathbf{BC}) = (\mathbf{AB})\mathbf{C}$) and distributive ($\mathbf{A}(\mathbf{B} + \mathbf{C}) = \mathbf{AB} + \mathbf{AC}$), but generally not commutative ($\mathbf{AB} \neq \mathbf{BA}$). The order matters. For example: $$ \begin{bmatrix} 1 & 2 \ 0 & 1 \end{bmatrix} \begin{bmatrix} 0 & 1 \ 1 & 0 \end{bmatrix} = \begin{bmatrix} 2 & 1 \ 1 & 0 \end{bmatrix}, \quad \text{but} \quad \begin{bmatrix} 0 & 1 \ 1 & 0 \end{bmatrix} \begin{bmatrix} 1 & 2 \ 0 & 1 \end{bmatrix} = \begin{bmatrix} 0 & 1 \ 1 & 2 \end{bmatrix} $$ Different results! This is why we must be careful about order in transformer architectures. The **transpose** of a matrix $\mathbf{A}$ is denoted $\mathbf{A}^T$ and defined by swapping rows and columns: $(A^T)_{ij} = a_{ji}$. Some useful properties: $$ (\mathbf{A}^T)^T = \mathbf{A}, \quad (\mathbf{AB})^T = \mathbf{B}^T\mathbf{A}^T, \quad (\mathbf{A} + \mathbf{B})^T = \mathbf{A}^T + \mathbf{B}^T $$ A matrix $\mathbf{A}$ is **symmetric** if $\mathbf{A}^T = \mathbf{A}$. Symmetric matrices have special properties we'll encounter when we study attention patterns. ## Norms and distances We need a way to measure the "size" or "length" of vectors. In ordinary space, we use the Pythagorean theorem: a vector $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ has length $\sqrt{3^2 + 4^2} = \sqrt{9 + 16} = 5$. But there are other sensible ways to measure size, and different measures are useful in different contexts. A **norm** is a function that assigns a non-negative size to each vector. For a function $\|\cdot\|$ to be a proper norm, it must satisfy three properties: 1. **Positive definiteness**: $\mathbf{v}\| \geq 0$ with equality if and only if $\mathbf{v} = \mathbf{0}$ 2. **Homogeneity**: $\|a\mathbf{v}\| = |a|\|\mathbf{v}\|$ for any scalar $a$ 3. **Triangle inequality**: $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$ Let's understand what these mean. Positive definiteness says only the zero vector has zero size. Homogeneity says if you scale a vector by factor $a$, its size scales by $|a|$. The triangle inequality says the direct path from origin to $\mathbf{u} + \mathbf{v}$ is never longer than going via $\mathbf{u}$ first. This generalizes the geometric fact that a straight line is the shortest path. What would violate these properties? Consider $f(\mathbf{v}) = \|\mathbf{v}\|_2 + 1$. This seems like a reasonable "size" measure, but it fails homogeneity: $f(2\mathbf{v}) = 2\|\mathbf{v}\|_2 + 1$, while $2f(\mathbf{v}) = 2\|\mathbf{v}\|_2 + 2$. The constant term breaks scaling. Or consider $g(\mathbf{v}) = \sum_i v_i$ (sum without absolute values). For $\mathbf{v} = \begin{bmatrix} 1 \ -1 \end{bmatrix}$, we get $g(\mathbf{v}) = 0$ despite $\mathbf{v} \neq \mathbf{0}$, violating positive definiteness. ### Common norms The most common norms form a family called $p$-norms: $$ \mathbf{v}\|_p = \left(\sum_{i=1}^n |v_i|^p\right)^{1/p} $$ Different values of $p$ give different norms: **$L^1$ norm** ($p = 1$): $\mathbf{v}\|_1 = \sum_{i=1}^n |v_i|$. This sums the absolute values of components. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_1 = |3| + |-4| = 7$. This is also called the Manhattan distance or taxicab norm, because it measures distance as if you could only travel along axis-aligned streets. **$L^2$ norm** ($p = 2$): $\mathbf{v}\|_2 = \sqrt{\sum_{i=1}^n v_i^2}$. This is the Euclidean norm, the "ordinary" geometric length. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_2 = \sqrt{9 + 16} = 5$. This is what we usually mean by "length" in everyday geometry. **$L^\infty$ norm** ($p = \infty$): $\mathbf{v}\|_\infty = \max_{i} |v_i|$. This takes the largest absolute component. For $\mathbf{v} = \begin{bmatrix} 3 \ -4 \end{bmatrix}$, we get $\mathbf{v}\|_\infty = \max(3, 4) = 4$. The name comes from taking the limit as $p \to \infty$ in the $p$-norm formula. Let's verify these give different values on a concrete example. For $\mathbf{v} = \begin{bmatrix} 2 \ 3 \ -1 \end{bmatrix}$: $$ \mathbf{v}\|_1 = 2 + 3 + 1 = 6, \quad \mathbf{v}\|_2 = \sqrt{4 + 9 + 1} = \sqrt{14} \approx 3.74, \quad \mathbf{v}\|_\infty = 3 $$ All three are valid norms. In transformers, we most commonly use the $L^2$ norm because it has nice geometric and optimization properties. Let's look more closely at the triangle inequality: $\|\mathbf{u} + \mathbf{v}\| \leq \|\mathbf{u}\| + \|\mathbf{v}\|$. This says the direct path from origin to $\mathbf{u} + \mathbf{v}$ is never longer than going via $\mathbf{u}$ first. Geometrically, it captures the fact that a straight line is the shortest path between two points. What if this didn't hold? Imagine a "norm" where $\|\mathbf{u} + \mathbf{v}\| > \|\mathbf{u}\| + \|\mathbf{v}\|$ for some vectors. Taking a detour would be shorter than going directly. This would break our geometric intuition about distance. For instance, if walking from A to B directly took 10 minutes, but walking from A to C then C to B took only 8 minutes, our notion of "distance" would be broken. You could keep finding shorter and shorter paths by adding more intermediate points, which makes no sense for measuring actual spatial distance. The triangle inequality ensures that distances behave sensibly: the direct route is always shortest (or at worst, tied). ### Distance Once we have a norm, we can define **distance** between vectors. The distance from $\mathbf{u}$ to $\mathbf{v}$ is simply the norm of their difference: $$ d(\mathbf{u}, \mathbf{v}) = \|\mathbf{u} - \mathbf{v}\| $$ Different norms give different distance measures. Using $L^2$, we get Euclidean distance. Using $L^1$, we get Manhattan distance. In transformers, distances between embedding vectors indicate semantic similarity: words with similar meanings have embeddings that are close together. ### Unit vectors and normalization A **unit vector** is one with norm 1. We can convert any nonzero vector $\mathbf{v}$ into a unit vector pointing in the same direction by **normalizing** it: $$ \hat{\mathbf{v}} = \frac{\mathbf{v}}{\|\mathbf{v}\|} $$ Let's verify this has norm 1: $\|\hat{\mathbf{v}}\| = \left\|\frac{\mathbf{v}}{\|\mathbf{v}\|}\right\| = \frac{1}{\|\mathbf{v}\|}\|\mathbf{v}\| = 1$ by homogeneity. For example, $\mathbf{v} = \begin{bmatrix} 3 \ 4 \end{bmatrix}$ has $\mathbf{v}\|_2 = 5$, so $\hat{\mathbf{v}} = \begin{bmatrix} 3/5 \ 4/5 \end{bmatrix}$ is the unit vector in the same direction. Normalization is ubiquitous in transformers. Layer normalization rescales activation vectors to have controlled statistics (zero mean and unit variance). This stabilizes training by preventing activations from growing too large or shrinking too small. When we compute attention weights, we often normalize vectors before taking dot products to ensure numerical stability. ### Matrix norms We can also define norms for matrices. The simplest is the **Frobenius norm**, which treats the matrix as a long vector: $$ \mathbf{A}\|_F = \sqrt{\sum_{i,j} a_{ij}^2} $$ For example, $\left\|\begin{bmatrix} 1 & 2 \ 3 & 4 \end{bmatrix}\right\|_F = \sqrt{1 + 4 + 9 + 16} = \sqrt{30}$. This measures the overall "size" of all matrix entries. ## Matrix properties and decompositions ### Rank The **rank** of a matrix $\mathbf{A}$ is the dimension of the vector space spanned by its columns (equivalently, by its rows). Intuitively, rank measures the number of independent directions in the output. Consider $\mathbf{A} \in \mathbb{R}^{m \times n}$ as a linear transformation from $\mathbb{R}^n$ to $\mathbb{R}^m$. The rank tells us the dimension of the image: how much of the output space can we actually reach? Let's see concrete examples. The matrix $\mathbf{A} = \begin{bmatrix} 1 & 0 \ 0 & 1 \end{bmatrix}$ has rank 2. Both columns are linearly independent, and we can reach any point in $\mathbb{R}^2$ by multiplying appropriate vectors. But consider $\mathbf{B} = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$. The second column is twice the first: $\begin{bmatrix} 2 \ 4 \end{bmatrix} = 2\begin{bmatrix} 1 \ 2 \end{bmatrix}$. Why does this mean we only span one dimension? Recall that $\mathbf{B}\mathbf{x}$ is a linear combination of $\mathbf{B}$'s columns using weights from $\mathbf{x}$. The first column points in direction $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, and the second column also points in direction $\begin{bmatrix} 1 \ 2 \end{bmatrix}$ (just scaled by 2). Both columns lie on the same line through the origin. Any combination of them must also lie on that same line. We're mixing together two vectors that both point in the same direction, so the result can only point in that direction too. We can never "escape" the line to reach other parts of the 2D plane. The column space of $\mathbf{B}$ is just the line along $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, which is one-dimensional. It has rank 1. For any vector $\mathbf{x} = \begin{bmatrix} x_1 \ x_2 \end{bmatrix}$: $$ \mathbf{B}\mathbf{x} = x_1\begin{bmatrix} 1 \ 2 \end{bmatrix} + x_2\begin{bmatrix} 2 \ 4 \end{bmatrix} = (x_1 + 2x_2)\begin{bmatrix} 1 \ 2 \end{bmatrix} $$ All outputs lie on the same line. The transformation collapses the 2D plane onto a 1D line. What does "information is lost" mean? Consider two different inputs: $$ \mathbf{x}_1 = \begin{bmatrix} 1 \ 0 \end{bmatrix}, \quad \mathbf{x}_2 = \begin{bmatrix} -1 \ 1 \end{bmatrix} $$ Let's multiply both by $\mathbf{B}$: $$ \mathbf{B}\mathbf{x}_1 = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}\begin{bmatrix} 1 \ 0 \end{bmatrix} = \begin{bmatrix} 1 \ 2 \end{bmatrix} $$ $$ \mathbf{B}\mathbf{x}_2 = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}\begin{bmatrix} -1 \ 1 \end{bmatrix} = \begin{bmatrix} -1 + 2 \ -2 + 4 \end{bmatrix} = \begin{bmatrix} 1 \ 2 \end{bmatrix} $$ The same output! Two completely different inputs map to the same result. If someone gives us the output $\begin{bmatrix} 1 \ 2 \end{bmatrix}$, we can't tell whether it came from $\mathbf{x}_1$, $\mathbf{x}_2$, or infinitely many other vectors. The original information about which input we started with is irrecoverably lost. This is why $\mathbf{B}$ has no inverse: we can't "undo" the transformation because many inputs collapse to each output. Geometrically, imagine the 2D plane collapsing onto a line. Every point perpendicular to the line direction gets squashed to zero in that direction. Points at $(1, 0)$, $(0, 0.5)$, $(-1, 1)$, and infinitely many others all land at the same spot on the output line. The dimension perpendicular to $\begin{bmatrix} 1 \ 2 \end{bmatrix}$ is completely annihilated. A matrix is **full rank** if its rank equals the minimum of its dimensions. For a square matrix $\mathbf{A} \in \mathbb{R}^{n \times n}$, full rank means rank equals $n$. This is crucial: full rank square matrices are invertible. They don't lose information, so we can "undo" the transformation. Non-full-rank matrices collapse space and are not invertible. ### Determinant The **determinant** tells us how a matrix transformation changes areas (in 2D) or volumes (in higher dimensions). For example, apply the matrix $\begin{bmatrix} 2 & 0 \ 0 & 3 \end{bmatrix}$ to the unit square. It stretches horizontally by 2 and vertically by 3, producing a $2 \times 3$ rectangle with area 6. The determinant is $2 \cdot 3 - 0 \cdot 0 = 6$, exactly the area scaling factor. But why does this formula work? Let's build deep intuition. The unit square has edges along the standard basis vectors: one edge goes from $(0,0)$ to $(1,0)$ (along $\mathbf{e}_1$), another from $(0,0)$ to $(0,1)$ (along $\mathbf{e}_2$). When we apply a matrix $\mathbf{A}$, where do these edges go? The columns of $\mathbf{A}$ tell us: $$ \mathbf{A} = \begin{bmatrix} a & b \ c & d \end{bmatrix} \implies \mathbf{e}_1 \to \begin{bmatrix} a \ c \end{bmatrix}, \quad \mathbf{e}_2 \to \begin{bmatrix} b \ d \end{bmatrix} $$ The unit square transforms into a parallelogram whose sides are the columns of $\mathbf{A}$. So the question "what area does the unit square become?" is equivalent to "what's the area of the parallelogram spanned by the columns?" The area of a parallelogram with sides $\mathbf{u}$ and $\mathbf{v}$ is: base times height, where height is the perpendicular distance. If $\mathbf{u} = [a, c]^T$ and $\mathbf{v} = [b, d]^T$: - Take $\mathbf{u}$ as the base, with length $\sqrt{a^2 + c^2}$ - The height is how far $\mathbf{v}$ extends perpendicular to $\mathbf{u}$ - After working through the geometry, this equals $\frac{|ad - bc|}{\sqrt{a^2 + c^2}}$ - Multiplying base $\times$ height: $\sqrt{a^2 + c^2} \cdot \frac{|ad - bc|}{\sqrt{a^2 + c^2}} = |ad - bc|$ So the determinant formula: $$ \det(\mathbf{A}) = ad - bc $$ is exactly the (signed) area of the parallelogram formed by the columns. The sign captures orientation: positive if the columns maintain the same rotational order as the original basis (counterclockwise), negative if they flip it. Here's another way to see it intuitively. The determinant measures how much "independent spreading" the columns do. If both columns point in similar directions, they don't spread much, so the parallelogram is thin (small area). If they point in perpendicular directions, they spread maximally (large area). If they point in exactly the same direction, they don't spread at all (zero area, the parallelogram collapses to a line). Let's see this concretely. Consider $\mathbf{A} = \begin{bmatrix} 3 & 1 \ 1 & 2 \end{bmatrix}$. The corners of the unit square transform as follows: \begin{align} (0,0) &\to A\begin{bmatrix} 0 \\ 0 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ (1,0) &\to A\begin{bmatrix} 1 \\ 0 \end{bmatrix} = \begin{bmatrix} 3 \\ 1 \end{bmatrix} \\ (0,1) &\to A\begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (1,1) &\to A\begin{bmatrix} 1 \\ 1 \end{bmatrix} = \begin{bmatrix} 4 \\ 3 \end{bmatrix} \end{align} The unit square becomes a parallelogram with corners at $(0,0)$, $(3,1)$, $(1,2)$, $(4,3)$. The area of a parallelogram with sides $\mathbf{u}$ and $\mathbf{v}$ is $| A \times A|$ (the magnitude of the cross product). For our parallelogram with sides $[3,1]^T$ and $[1,2]^T$: $$ \text{area} = |3 \cdot 2 - 1 \cdot 1| = |6 - 1| = 5 $$ Now compute the determinant: $$ \det(\mathbf{A}) = 3 \cdot 2 - 1 \cdot 1 = 6 - 1 = 5 $$ They match! The determinant directly gives us the area scaling factor. The original square had area 1, the transformed parallelogram has area 5, so $\mathbf{A}$ scales areas by a factor of 5. This extends to any shape, not just the unit square. If we have a triangle with area 10, applying $\mathbf{A}$ transforms it to a new triangle with area $10 \times 5 = 50$. The determinant is the universal area scaling factor. Now consider our rank-deficient matrix from earlier: $\mathbf{B} = \begin{bmatrix} 1 & 2 \ 2 & 4 \end{bmatrix}$. Let's see what happens to the unit square: \begin{align} (0,0) &\to \begin{bmatrix} 0 \\ 0 \end{bmatrix} \\ (1,0) &\to \begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (0,1) &\to \begin{bmatrix} 2 \\ 4 \end{bmatrix} = 2\begin{bmatrix} 1 \\ 2 \end{bmatrix} \\ (1,1) &\to \begin{bmatrix} 3 \\ 6 \end{bmatrix} = 3\begin{bmatrix} 1 \\ 2 \end{bmatrix} \end{align} All four corners lie on the line through $\begin{bmatrix} 1 \ 2 \end{bmatrix}$! The square collapses to a line segment from $(0,0)$ to $(3,6)$. A line segment has zero area. The determinant: $$ \det(\mathbf{B}) = 1 \cdot 4 - 2 \cdot 2 = 4 - 4 = 0 $$ Zero! This makes perfect sense. When a transformation collapses 2D shapes onto a 1D line, all areas become zero. In general, $\det(\mathbf{A}) = 0$ if and only if $\mathbf{A}$ is not full rank. The transformation loses a dimension, squashing everything flat. When $\det(\mathbf{A}) \neq 0$, areas are scaled but not annihilated. This means the transformation doesn't collapse space, so it's invertible. We can undo it. When $\det(\mathbf{A}) = 0$, areas become zero, dimensions collapse, and we can't reverse the process. What about the sign of the determinant? Consider $\mathbf{R} = \begin{bmatrix} 1 & 0 \ 0 & -1 \end{bmatrix}$: $$ \det(\mathbf{R}) = 1 \cdot (-1) - 0 \cdot 0 = -1 $$ This matrix flips the $y$-axis. The unit square still becomes a parallelogram with area $|-1| = 1$ (no scaling), but the orientation reverses. Imagine the square has corners labeled clockwise as $(0,0)$, $(1,0)$, $(1,1)$, $(0,1)$. After transformation, the corners appear in counterclockwise order. The sign tells us about orientation: positive preserves it, negative reverses it. In higher dimensions, the pattern continues. For a $3 \times 3$ matrix: $$ \mathbf{A} = \begin{bmatrix} a & b & c \ d & e & f \ g & h & i \end{bmatrix} $$ The determinant is: $$ \det(\mathbf{A}) = a(ei - fh) - b(di - fg) + c(dh - eg) $$ This looks complex, but the principle is the same: it measures how much the transformation scales volumes. Let's compute an example: $$ \mathbf{A} = \begin{bmatrix} 2 & 0 & 0 \ 0 & 3 & 0 \ 0 & 0 & 1 \end{bmatrix} $$ This is a diagonal matrix that scales the $x$-axis by 2, $y$-axis by 3, and leaves $z$-axis unchanged: $$ \det(\mathbf{A}) = 2(3 \cdot 1 - 0 \cdot 0) - 0(\cdots) + 0(\cdots) = 2 \cdot 3 = 6 $$ The unit cube (volume 1) gets stretched to a box with dimensions $2 \times 3 \times 1$, which has volume 6. For diagonal matrices, the determinant is simply the product of diagonal entries: $\det(\mathbf{A}) = 2 \cdot 3 \cdot 1 = 6$. This makes intuitive sense: if you scale each dimension independently, the total volume scales by the product of all scaling factors. The general principle: $\det(\mathbf{A})$ is the factor by which volumes get multiplied. Zero determinant means collapse to lower dimension. Negative means orientation reversal. This holds in any dimension. ### Eigenvalues and eigenvectors An **eigenvalue** of $\mathbf{A}$ is a scalar $\lambda$ such that: $$ \mathbf{A}\mathbf{v} = \lambda\mathbf{v} $$ for some nonzero vector $\mathbf{v}$, called an **eigenvector**. This is a profound concept: when we apply $\mathbf{A}$ to eigenvector $\mathbf{v}$, the result points in the same direction as $\mathbf{v}$, just scaled by $\lambda$. Most vectors get rotated and stretched in complicated ways when multiplied by a matrix. Eigenvectors are special directions that only get scaled. Why does this matter? Eigenvectors reveal the natural "axes" of a transformation. If we express vectors in the eigenvector basis, matrix multiplication becomes trivial: each component just gets multiplied by its corresponding eigenvalue. This is why eigendecomposition is so powerful. Let's find the eigenvalues and eigenvectors of $\mathbf{A} = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}$. We need $\mathbf{A}\mathbf{v} = \lambda\mathbf{v}$, which we can rewrite as: $$ \mathbf{A}\mathbf{v} - \lambda\mathbf{v} = \mathbf{0} $$ $$ (\mathbf{A} - \lambda\mathbf{I})\mathbf{v} = \mathbf{0} $$ For a nonzero solution $\mathbf{v}$ to exist, the matrix $\mathbf{A} - \lambda\mathbf{I}$ must not be invertible, so: $$ \det(\mathbf{A} - \lambda\mathbf{I}) = 0 $$ For our example: $$ \det\left(\begin{bmatrix} 4-\lambda & 1 \ 2 & 3-\lambda \end{bmatrix}\right) = (4-\lambda)(3-\lambda) - 2 = 0 $$ $$ 12 - 4\lambda - 3\lambda + \lambda^2 - 2 = 0 $$ $$ \lambda^2 - 7\lambda + 10 = 0 $$ $$ (\lambda - 5)(\lambda - 2) = 0 $$ The eigenvalues are $\lambda_1 = 5$ and $\lambda_2 = 2$. Now let's find the eigenvectors. For $\lambda_1 = 5$: $$ (\mathbf{A} - 5\mathbf{I})\mathbf{v} = \begin{bmatrix} -1 & 1 \ 2 & -2 \end{bmatrix}\begin{bmatrix} v_1 \ v_2 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix} $$ From the first row: $-v_1 + v_2 = 0$, so $v_2 = v_1$. One eigenvector is $\mathbf{v}_1 = \begin{bmatrix} 1 \ 1 \end{bmatrix}$. For $\lambda_2 = 2$: $$ (\mathbf{A} - 2\mathbf{I})\mathbf{v} = \begin{bmatrix} 2 & 1 \ 2 & 1 \end{bmatrix}\begin{bmatrix} v_1 \ v_2 \end{bmatrix} = \begin{bmatrix} 0 \ 0 \end{bmatrix} $$ From the first row: $2v_1 + v_2 = 0$, so $v_2 = -2v_1$. One eigenvector is $\mathbf{v}_2 = \begin{bmatrix} 1 \ -2 \end{bmatrix}$. Let's verify: $$ \mathbf{A}\mathbf{v}_1 = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \ 1 \end{bmatrix} = \begin{bmatrix} 5 \ 5 \end{bmatrix} = 5\begin{bmatrix} 1 \ 1 \end{bmatrix} \quad \checkmark $$ $$ \mathbf{A}\mathbf{v}_2 = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 2 \ -4 \end{bmatrix} = 2\begin{bmatrix} 1 \ -2 \end{bmatrix} \quad \checkmark $$ Perfect! When we apply $\mathbf{A}$ to $\mathbf{v}_1$, it gets scaled by 5. When we apply $\mathbf{A}$ to $\mathbf{v}_2$, it gets scaled by 2. These are the natural directions of the transformation. Here's where eigenvectors become powerful: they simplify matrix multiplication. Any vector $\mathbf{x}$ can be written as a combination of eigenvectors: $\mathbf{x} = c_1\mathbf{v}_1 + c_2\mathbf{v}_2$. Then: $$ \mathbf{A}\mathbf{x} = \mathbf{A}(c_1\mathbf{v}_1 + c_2\mathbf{v}_2) = c_1\mathbf{A}\mathbf{v}_1 + c_2\mathbf{A}\mathbf{v}_2 = c_1\lambda_1\mathbf{v}_1 + c_2\lambda_2\mathbf{v}_2 $$ Instead of doing full matrix multiplication, we just multiply each coefficient by its eigenvalue! Let's try this with $\mathbf{x} = \begin{bmatrix} 3 \ 1 \end{bmatrix}$. First, express $\mathbf{x}$ in the eigenvector basis. We need $c_1$ and $c_2$ such that: $$ c_1\begin{bmatrix} 1 \ 1 \end{bmatrix} + c_2\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 3 \ 1 \end{bmatrix} $$ This gives $c_1 + c_2 = 3$ and $c_1 - 2c_2 = 1$. Solving: $c_1 = 7/3$ and $c_2 = 2/3$. Now applying $\mathbf{A}$ is trivial: $$ \mathbf{A}\mathbf{x} = \frac{7}{3} \cdot 5 \cdot \begin{bmatrix} 1 \ 1 \end{bmatrix} + \frac{2}{3} \cdot 2 \cdot \begin{bmatrix} 1 \ -2 \end{bmatrix} = \frac{35}{3}\begin{bmatrix} 1 \ 1 \end{bmatrix} + \frac{4}{3}\begin{bmatrix} 1 \ -2 \end{bmatrix} = \begin{bmatrix} 39/3 \ 27/3 \end{bmatrix} = \begin{bmatrix} 13 \ 9 \end{bmatrix} $$ Let's verify by direct multiplication: $$ \mathbf{A}\mathbf{x} = \begin{bmatrix} 4 & 1 \ 2 & 3 \end{bmatrix}\begin{bmatrix} 3 \ 1 \end{bmatrix} = \begin{bmatrix} 12 + 1 \ 6 + 3 \end{bmatrix} = \begin{bmatrix} 13 \ 9 \end{bmatrix} \quad \checkmark $$ The eigenvector approach seems longer here, but for repeated multiplications (like $\mathbf{A}^{100}\mathbf{x}$), it's vastly simpler: just raise each eigenvalue to the power. In the eigenvector basis, $\mathbf{A}^{100}\mathbf{x} = c_1 \cdot 5^{100} \cdot \mathbf{v}_1 + c_2 \cdot 2^{100} \cdot \mathbf{v}_2$. No need to multiply the matrix by itself 100 times. What if a matrix has no real eigenvalues? Consider a rotation matrix $\mathbf{R} = \begin{bmatrix} \cos\theta & -\sin\theta \ \sin\theta & \cos\theta \end{bmatrix}$ for $\theta \neq 0, \pi$. This rotates vectors, so no direction stays on the same line. There are no real eigenvectors. The eigenvalues are complex: $\lambda = \cos\theta \pm i\sin\theta = e^{\pm i\theta}$. Complex eigenvalues indicate rotational behavior. In transformers, eigenvalues appear in several ways. Weight matrices have eigenvalue spectra that affect gradient flow during training. Large eigenvalues can cause exploding gradients, while very small ones cause vanishing gradients. Normalization techniques (like layer norm) can be understood as controlling these eigenvalue distributions. When we analyze attention patterns, the eigenstructure of attention weight matrices reveals dominant patterns of information flow. ### Orthogonal matrices An **orthogonal matrix** $\mathbf{Q}$ satisfies $\mathbf{Q}^T\mathbf{Q} = \mathbf{I}$. This means the columns of $\mathbf{Q}$ form an orthonormal basis: each column has norm 1, and different columns are perpendicular. Orthogonal matrices represent pure rotations and reflections without any scaling or skewing. The key property: orthogonal matrices preserve lengths and angles. For any vectors $\mathbf{u}, \mathbf{v}$: $$ \mathbf{Q}\mathbf{v}\|_2 = \|\mathbf{v}\|_2, \quad (\mathbf{Q}\mathbf{u}) \cdot (\mathbf{Q}\mathbf{v}) = \mathbf{u} \cdot \mathbf{v} $$ This makes sense geometrically: rotations and reflections don't change distances or angles, only orientation. The determinant of an orthogonal matrix is always $\pm 1$: rotations have $\det(\mathbf{Q}) = 1$, reflections have $\det(\mathbf{Q}) = -1$. No volume scaling occurs. ## Intuition: The matrix as a universal lens We have covered eigenvalues, basis changes, and dot products. Before we move on to calculus, we must pause and internalize a crucial intuition. In traditional mathematics, we often treat matrices as static tables of data or simple systems of equations. In deep learning we must view matrices differently. **A matrix is a machine. It is a lens.** Every time you see a matrix multiplication $\mathbf{y} = \mathbf{W}\mathbf{x}$ in a neural network, the matrix $\mathbf{W}$ is performing a specific physical action on the vector $\mathbf{x}$. It is not just multiplying numbers; it is transforming information. Crucially, the *nature* of this transformation is dictated by the shape of the matrix. Specifically, it depends on how the number of outputs compares to the number of inputs. ### 1. The compressor (bottleneck) The first type of lens appears when a matrix maps a large vector to a smaller one (many inputs to fewer outputs). This forces **lossy compression**. It is physically impossible to keep all the information from the larger space, so the matrix must make choices about what to keep and what to throw away. Mathematically, if we have an input $\mathbf{x} \in \mathbb{R}^{512}$ and a matrix $\mathbf{W} \in \mathbb{R}^{64 \times 512}$, the output $\mathbf{y}$ has only 64 dimensions. Let's look at the operation row by row. If we call the rows of the matrix $\mathbf{r}_1, \mathbf{r}_2, \dots, \mathbf{r}_{64}$, the multiplication looks like this: $$ \mathbf{y} = \mathbf{W}\mathbf{x} = \begin{bmatrix} \text{--- } \mathbf{r}_1 \text{ ---} \\ \text{--- } \mathbf{r}_2 \text{ ---} \\ \vdots \\ \text{--- } \mathbf{r}_{64} \text{ ---} \end{bmatrix} \mathbf{x} = \begin{bmatrix} \mathbf{r}_1 \cdot \mathbf{x} \\ \mathbf{r}_2 \cdot \mathbf{x} \\ \vdots \\ \mathbf{r}_{64} \cdot \mathbf{x} \end{bmatrix} $$ Each element $y_i$ is simply the dot product $\mathbf{r}_i \cdot \mathbf{x}$. Recall that the dot product is a similarity test. If $\mathbf{x}$ is perpendicular (orthogonal) to a row $\mathbf{r}_i$, the result is 0. Now, imagine a part of vector $\mathbf{x}$ that points in a direction completely different from *all* 64 rows. It is orthogonal to every single row vector. * $\mathbf{r}_1 \cdot \mathbf{x} = 0$ * $\mathbf{r}_2 \cdot \mathbf{x} = 0$ * ... * $\mathbf{r}_{64} \cdot \mathbf{x} = 0$ The result is the zero vector $\mathbf{0}$. This is exactly what the **null space** is: the set of all inputs that get "annihilated" (mapped to zero) because they don't align with any of the matrix's feature detectors. The matrix is blind to these directions. It destroys that information completely. Imagine describing a complex scene, like a busy city street, to a friend over a noisy phone line. You have 10 seconds. You cannot list every photon of light or every cracked pavement stone. You must prioritize: "Red car. Speeding. Police chasing." A matrix that reduces dimension acts as this filter. By reducing the available space, we force the model to identify the *essence* of the input. It strips away noise, nuance, and irrelevant details. The matrix learns to be a "feature detector" for the most critical patterns. Only inputs matching these patterns survive the projection; everything else is ignored. In practice, we use this to distill a complex object (like a word with many definitions) into a focused representation of just one specific aspect (like its grammatical role). ### 2. The expander (unfolding) The second type of lens appears when a matrix maps a small vector to a larger one (few inputs to many outputs). This creates **space for analysis**. It allows the model to "unpack" information that was tightly compressed. Mathematically, consider mapping $\mathbf{x} \in \mathbb{R}^{512}$ to $\mathbf{y} \in \mathbb{R}^{2048}$. We define 2048 new direction vectors (the rows). Because we have more outputs than inputs, we are generating an **over-complete** representation. Does this create "new" information? No. The output vector $\mathbf{y}$ lives in a high-dimensional space ($\mathbb{R}^{2048}$), but it is constrained to a lower-dimensional **subspace** (specifically, a 512-dimensional flat sheet called the *column space* or *image* of the matrix). You cannot reach *every* point in the 2048-dimensional space, only those that can be formed by combining the columns of $\mathbf{W}$. So why bother? This is similar to adding **polynomial features** in regression. Imagine you have points on a 1D line that are red-blue-red. You cannot separate them with a single straight cut (a linear classifier). But if you map each point $x$ to a 2D vector $[x, x^2]$, the points lift onto a parabola. Now, a simple straight line can slice through the parabola to separate red from blue. The expander matrix performs a similar "lifting" operation. It computes 2048 distinct linear combinations of the original features. It effectively says: "Let's look at the data from 2048 different angles simultaneously." By projecting the data onto this higher-dimensional manifold, we increase the probability that complex, entangled patterns will become linearly separable, allowing the subsequent layers (like ReLU) to slice them apart cleanly. Think of a crumpled piece of paper with writing on it. In its compressed ball state, the words touch and overlap so you cannot read them. To understand it, you must unfold it into a larger flat space. Or think of separate ingredients like flour, sugar, and eggs versus a mixed batter. To chemically analyze the batter, you might need to separate the components back out. A matrix that increases dimension creates this "wiggle room." By increasing the space, we make it possible to separate complex patterns that were entangled in the lower dimension. We project the data into a high-dimensional space where it is easier to categorize. The matrix generates an "over-complete" representation, computing many distinct combinations of the input features to create a massive menu of potential patterns to look for. In practice, we use this to perform complex logical operations. We expand the data, inspect it in detail, and then compress it back down. ### 3. The mixer (rotation/perspective) The third type of lens appears when the input and output sizes are the same. Since the matrix isn't compressing or expanding capacity, it is instead **translating languages**. It acts as a rotation or a change of perspective. Mathematically, if $\mathbf{W} \in \mathbb{R}^{512 \times 512}$ is invertible (full rank), it performs a **change of basis**. The output $\mathbf{y}$ contains exactly the same amount of information as the input $\mathbf{x}$, just reorganized. We can write $\mathbf{y}$ as a linear combination of the columns of $\mathbf{W}$: $\mathbf{y} = x_1\mathbf{c}_1 + x_2\mathbf{c}_2 + \dots + x_n\mathbf{c}_n$. The matrix effectively rotates the vector space, aligning the data's internal axes with the standard axes that the next layer expects. No information is lost (null space is zero), and no extra space is created; the data just "turns" to face a new direction. Think of holding a map upside down. The information is all there. The distances are correct and the landmarks exist. But it is useless for navigation because the orientation doesn't match your reality. You need to rotate it. A square matrix performs this re-orientation. It mixes the independent channels of the vector, essentially saying: "Don't look at Feature 1 and Feature 2 in isolation; look at their sum and their difference." It acts as a switchboard or a mixing desk, routing information from where it was computed to where it is needed next. In practice, we use this to integrate information, taking distinct, segregated reports (like "Subject is John" and "Verb is Run") and mixing them into a single, unified meaning. ### Summary As we progress to neural networks, never look at a weight matrix $\mathbf{W}$ as just a bag of numbers. Look at its shape. If it is shrinking the vector, it is a **Summarizer** forcing the model to decide what matters. If it is growing the vector, it is an **Analyzer** trying to untangle complex relationships. If it is keeping the size, it is a **Translator** reorganizing information for the next step. This "lens" intuition is more powerful than memorizing formulas because it tells you the *intent* of each component in the architecture.